Soil contig tax - Githubissues

brynnz22 commented 6 months ago

Added a python notebook to look out how the taxonomic distribution of contigs differ by soil layer (mineral vs. organic) in Colorado. This uses NMDC metadata to access and analyze metagenome data.

kheal commented 6 months ago

@brynnz22 Overall, an impressive feat to pull together all these API calls and merges in a consumable way. Wow! The notebook rendered fine without running the calls or accessing the pkl files, so I think we are good on that front; though we should add to the readme for the google colab that rerunning this in an interactive environment is not recommended and will likely break for these reasons.

Do you use the 'taxonomic_dist_by_soil_layer/python/mongodb_query.txt.js' file? Maybe I'm missing it, but if not please remove from the branch.

In first Markdown cell, the word 'object' is misspelled.

Once we have the tsv urls, I think it would be useful to show a single sample's results before concatenating them all together. That dataframe should have soil horizion, biosample id, geo_loc, taxa, and count.

Biologically, its not correct to add together counts between samples, so I think we need to revisit the last couple code chunks to make a bit more sense. I have a couple ideas for this that shouldn't be too painful (hopefully!).

brynnz22 commented 6 months ago

@kheal addressing your points above, I:

edited the readme to explain that running in the interactive environment is not recommended
We should the mongodb_query.txt.js file because this helped inform the API request traversals. Also, this is helpful to inform the endpoint being created.
I fixed the mispelling of object
I printed a snippet of a TSV to show what it looks like
Finally, we discussed the last point and that the way I did it was correct.

I also created a second plot faceted by locations in Colorado

Thanks for the feedback :)

kheal commented 6 months ago

In md cell 35 "Example of what the TSV contig taxa file looks like"; we decided the third column is not percent (otherwise it would add up to > 100%), so that text should be edited. How about something along the lines of "The first column is the identifier of a single contig, the second is the taxonomic placement of the contig, the third is a simple count". In py cell 36; I would rename the percent column to count. Also, we will need to calculate relative abundance per sample per taxa, and then calculate average relative abundance per horizon, as we discussed.

brynnz22 commented 6 months ago

Okay! I think we are good! Thanks again for all of your help!!

brynnz22 commented 6 months ago

@kheal I believe I got the nbviewer to work now: https://nbviewer.org/github/microbiomedata/notebook_hackathons/blob/soil-contig-tax/taxonomic_dist_by_soil_layer/python/taxonomic_dist_soil_layer.ipynb .

kheal commented 6 months ago

closes #8!

microbiomedata / nmdc_notebooks

Soil contig tax #24