meringlab / FlashWeave.jl

Inference of microbial interaction networks from large-scale heterogeneous abundance data
Other
71 stars 8 forks source link

BIOM v2.1 from Qiime2 #20

Closed ARW-UBT closed 4 years ago

ARW-UBT commented 4 years ago

Hi, I am trying to analyse microbiome data by using biom data generated by the most recent Qiime2 pipeline (qiime2-2020.8) but I get an error which I could not resolve so far.

julia> netw_results = learn_network(data_path, meta_data_path, sensitive=true, heterogeneous=false)

Loading data

ERROR: MethodError: no method matching zero(::SubString{String}) Closest candidates are: zero(::Type{Missing}) at missing.jl:103 zero(::Type{Dates.Time}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Dates\src\types.jl:406 zero(::Type{Dates.DateTime}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Dates\src\types.jl:404 ...

Does anybody have an idea what this could mean. I have attached the biom file created by Qiime2. (original name filtered_table.qza, but renamed to allow upload)

filtered_table_gza.zip

jtackm commented 4 years ago

Hi, thanks for filing the issue. What type of file is meta_data_path referring to? FlashWeave currently doesn't read meta data directly from a .biom file, but requires a separate .tsv/.csv (see #10). If you extract the metadata from the .biom file (for instance using the python biom package) and save it to .tsv, this should work:

data_path = "feature-table.biom"
meta_path = "meta-data.tsv"
netw_results = learn_network(data_path, meta_data_path, sensitive=true, heterogeneous=false)

If you want to omit meta data information, just set meta_path = nothing.

However, the error message is currently very uninformative, I will improve that for the next version. Also, if I see there is demand, I will prioritize solving the linked issue above to make this work out of the box.

ARW-UBT commented 4 years ago

Hi, meta_data_path was referring to a metadate tsv file, which has various string-type metadata. On your Github page, I read that "Meta variables containing string factors with more than two categories are automatically one-hot encoded by FlashWeave prior to network inference". So I assume, this should be done automatically. I attach this file (zipped) to this message.

The biom file does not contain metadata; 'biom eport-metadata' reports that there are no sample metadata and no obervation metadata in the bom file. Qiime2 is not perfectly transparent at these steps, what is included in the biom data and what is not included. But I will inspect it further, if there is a possibility to add the metadata to the biom output.

Biofilm-Bact2-metadata-FlashWeave.zip

ARW-UBT commented 4 years ago

I just want to add that a network could be generated from the biom data when omitting meta data. Do you have any suggestion how to modify the metadata.tsv content to include it in the analysis.

Which app do you recommend for graph visualization in Cytoscape?

jtackm commented 4 years ago

Sorry for the delay, I'm currently traveling. I see, the way you are using the software should definitely work. I tracked down the problem, it boiled down to FlashWeave trying to combine sparse data from your .biom file with non-sparsifiable data from your meta data table. I made a patch, your use case should now work on master. You can get the latest changes via ] + add FlashWeave#master. Could you try that and let me know if there are still issues? Thanks again for the report, this really helps fixing use cases I don't run into myself.

Regarding your second question, I typically just use Cytoscape without special plugins. From my experience the default capabilities are flexible enough, typical things I do include coloring nodes / edges using outside information, emphasising node centralities or edge strengths via size and thickness or laying out the network in various ways (depending on your question).

ARW-UBT commented 4 years ago

Hi! Thank you for the patch, it worked out without any error. I was able to export a raw PNG graph, and I am currently trying to find out, how to coorectly import the edgelist or gml networl into Cytoscape . Do you recomend one over the other? If there is any toturial out there, I would really appreciate. Best regards!

ARW-UBT commented 4 years ago

Sorry, closed it by mistake...

jtackm commented 4 years ago

I learned most of it by playing with the software and occasional googling, so unfortunately can't suggest a good tutorial off the top of my head, but I would be surprised if there aren't decent ones around. Alternatively you can also go the less interactive route via various python/R packages like networkx, graphviz and igraph, whatever feels most intuitive to you.

ARW-UBT commented 4 years ago

Yes, I will do the same, there are a few tutorials, and Cytoscape itself comes with sample data (including gml, but not edgelist). Actally, I realized that the FW-exported edgelist does not load in Cytoscape, although they list 'edgelist' (extension .el) in the import dialog. I am not dure whether the FW and Cytoscape formats are the same, the Cytoscape documentation does not describe edgelists.

The FW .gml dows load (over 3000 nodes and edges in a node table and edge table) and I will habe to dig into the style option to format the metadata information (is this exported to the gml file?).

I also tried to work with tsv/csv tables as described in the FW help: help?> save_network search: save_network save_network(net_path::AbstractString, net_result::FWResult) -> Void Save network results to disk. Available formats are '.tsv', '.csv', '.gml' and '.jld2'. However, FW julia tells me that tsv/csv are not valid output formats. ERROR: .tsv not a valid output format. Choose one of (".edgelist", ".gml", ".jld2")

Thank you for your help and for FlashWeave! Best regards,

jtackm commented 4 years ago

Sorry for the confusion about .tsv/.csv vs edgelist, the help string for FlashWeave save_network / load_network was outdated. FlashWeave currently only supports .edgelist, .gml and .jld2 (as the error message suggests), I fixed the help string now thanks to your report.

If you want to load .edgelist files in Cytoscape, you can go File -> Import -> Network from File.... Finally, you have to specify under Advanced Options to Ignore Lines starting with the letter # and select source/target columns for your edges. I agree this is a bit involved if you are not familiar with Cytoscape, I will hopefully find time to write more detailed documentation on this and other workflow-related issues in the future. In any case, the result should be identical to loading from a .gml file, please let me know it this should not be the case.

ARW-UBT commented 4 years ago

Thank you for the hint on the advanced options for edgelist data. The network ist displayed now. What I found at the end of the edgelist file is a part that obviously contains the metadata information. It seems to me that they become part of the node table in Cytoscape (see screenshot). Would you recommend to construct the network in FlashWeave without metadata and add them to Cytoscape later? In FW, the order of abundance data matches to the order of metadata, right? But I believe that this matching information is no longer present in the edglist file. May be, I am quite close to a Cytoscape network with metadata info, if I would know how to use the metadata section of the edglist file.

Thank you again for your patience! Cytoscale_FW-Metadata

jtackm commented 4 years ago

What I found at the end of the edgelist file is a part that obviously contains the metadata information. It seems to me that they become part of the node table in Cytoscape (see screenshot).

The meta data related information at the end of the Node table refers to one-hot encoded meta variables. To improve statistical power and interpretability, FlashWeave splits variables with more than two categories, like your Type-Month column, into separate dummy variables, each of which represents a different category from the original column. If you for instance find an edge (in Cytoscape under "Edge Table") from Type-Month_Ker-Aug to an OTU, it means that the abundance of that OTU is directly associated with the Ker-Aug category.

Would you recommend to construct the network in FlashWeave without metadata and add them to Cytoscape later?

I typically keep all meta variables initially and then explore the network to decide on useful visualizations. If meta variables turn out not to be interesting for visualization, one can always make them invisible (the mv column in the Node table tells you which ones are meta variables; this can be used for filtering). In any case, FlashWeave makes use of meta information to improve edge detection. So even if you are not interested in MV-OTU relationships, the OTU-OTU relationships also benefit from including meta variable information.

In FW, the order of abundance data matches to the order of metadata, right? But I believe that this matching information is no longer present in the edglist file.

I'm not sure what you mean by this. FlashWeave generally tries to keep variable ordering (both OTU and meta data) aligned with the input. However, numerical indices may change if for instance variables are removed during the filtering and normalization process or if one-hot encoding takes place. In any case, the variable names you see depicted in the Node table and Edge table always refer to the corresponding variable columns in the input file.

jtackm commented 4 years ago

Gonna close this for know, feel free to make new issues if you run into other FlashWeave-related errors.

ARW-UBT commented 3 years ago

Hi, I assume that my last post has not been saved. Sorry that I have to come back again. After import of the edgelist data to Cytoscape (ignoring lines with #) I end up with a network shown in my last post. If I move down to the end of the edgelist table, I find the MV data in the last lines, but there is no mv column in the Node Table. How can I specify the metadate during the import of the edgelist file? I attach the edgelist file that was exported from FlashWeave, and I would very much appreciate if you could have a look at it.

Thanks! BioTest1.zip

jtackm commented 3 years ago

Yes, sorry, I forgot to mention that the mv column is only present when loading .gml files. The problem with the .edgelist format is that it's poorly specified and while Cytoscape would be okay with having the mv column there, some tools will choke if there is too much information per edge. For the meantime, the easiest would be sticking to the .gml format for your use case. Alternatively, you could also add a mv column yourself, either as a tab-separated column to the .edgelist file (by comparing node identifiers in python/R/etc) or it might also be possible to add a column to the Node table directly within Cytoscape.

ARW-UBT commented 3 years ago

ok, it explaines eveything. Sorry, I am new to Cytoscape networks, and would ask you a final favor: I cannot see any network in the gml FlashWeave export but I could see it in the edgelist. There are no import dialogs for gml data, so I don't know what might be wrong. Could you check the gml file (attached) and shortly explain, what you did to display the network. Thanks BioTest1gml.zip

jtackm commented 3 years ago

Yes, it just stacked all network nodes on top to each other instead of computing the default network layout. Not sure why that is, perhaps it couldn't automatically infer which column corresponds to edge weights due to the additional mv column. In any case, you can compute your own layouts via the Layout menu on top. Fruchterman-Reingold gives good results in my experience, but is also quite slow. It is generally a good idea to try different layouts depending on your research question.