grp-bork / spire_contribute

3 stars 0 forks source link

Linking MAGs to representative IDs #1

Closed wwood closed 10 months ago

wwood commented 10 months ago

Hello there.

Great paper - so much work.

I'm trying to make use of these data, but am having a problem whereby I don't think it is possible to link MAG fasta file names:

I tried to download them on a per-study basis. This seems to require some parsing of HTML, but not too bad. However, it seems that the file names in the MAGs download (e.g. SAMN15803490.psa_megahit.psb_metabat2.00001.fa.gz) can't be linked with IDs as they are present in the metadata (e.g. spire_mag_00000001).

Is there a table of conversions somewhere? I would really rather not go through a lengthy task of dereplication if this is already completed.

Ideally, if there were a download of the dereplicated genomes somewhere that would be even better, so there's no need to jump through API hoops.

Thanks, ben

fullama commented 10 months ago

Well thats a mistake.. those ids are supposed to be converted, ill fix that and get back to you.. (i could send you a conversion table but maybe its easier for me to fix the downloads?)

This seems to require some parsing of HTML, but not too bad.

does this mean you wanted to download all studies? I could make a table available with the url for every study?

download of the dereplicated genomes

Do you mean a download of representative genomes of the clusters?

wwood commented 10 months ago

Thanks for the quick and helpful response.

(i could send you a conversion table but maybe its easier for me to fix the downloads?)

Fix the tar downloads? I've already downloaded most (some failed due to server errors - do you want a list of these?). Up to you which way is easiest.

I've been grabbing the links by grepping the HTML files

grep download_link -h ../study_pages/* |grep _MAG |sed 's/.*https/https/; s/".*//' |notify parallel -j1 --ungroup wget {}

A URL would save the regex on the HTML.

Do you mean a download of representative genomes of the clusters?

Yes - all I need is these (and this is probably true for others too I'd bet). I'm downloading everything now, just to then pick the reps out.

fullama commented 10 months ago

ah well then I will add a representative set to the download section.. but i will also fix the tar downloads and add some kind of table to get all the urls easier.. (ill try get the reps up either today or tomorrow)

if you did have a list of studies that failed that would be really helpful (they obviously shouldnt fail :) and i should find out why they do)

wwood commented 10 months ago

Thanks for making that download - will be very helpful. Re the issues I encountered:

For some of the study pages there's an internal error e.g. https://spire.embl.de/study/Liu_2021_buffalo?page=1

Here is a list of ones that failed for me (maybe intermittently..):

bioGEOTRACERS_marine_pelagic FMT_Lee JGI_ant_fungus_Panama JGI_aquatic_extreme_virus JGI_coalbed JGI_estuary_Chesapeake JGI_estuary_SFBD JGI_groundwater_Europe JGI_saline_lake_Antarctica JGI_saltmarsh_Skidaway JGI_wastewater_Wisconsin2 Liu_2021_buffalo MetaHIT MyMicrobes_public Patagonia_fjords PRJDB7630_cardiovascular_diseases PRJDB8987_macaques PRJEB45799_Segata_transmission PRJNA356291_animal_cow PRJNA400853_freshwater PRJNA415974_wastewater PRJNA436562 PRJNA479838_CRC PRJNA543206_chicken PRJNA565546_elderly_osteoporosis PRJNA699281_elderly PRJNA723432_goat studies-20_PRJEB23957 studies-20_PRJNA309119 studies-20_PRJNA390775 studies-20_PRJNA436990 studies-20_PRJNA489143 studies-20_PRJNA532676 studies-20_PRJNA564649

fullama commented 10 months ago

Hi, So it took a bit longer than i thought to pull all the files together. I will update the website tomorrow but if you want the representative set of MAGS can be found here: https://swifter.embl.de/~fullam/spire/representatives/spire_representative_genomes.tar

Also the MAG download links should be fixed with the correct ids now. Regarding the list of failed studies - i did a bunch of renaming before the paper came out - i didnt set up redirects because i didnt think anyone would have saved any bookmarks yet i could set up the redirects from old name to new name if you needed them?

wwood commented 10 months ago

Great, downloading now. Thanks a lot.

Re the failed studies, I got these after publication by going through the spire_v1_microntology.tsv.gz file from the downloads section, I didn't "bookmark" them in firefox or anything if that's what you mean.

It matters not to me to have them redirect, from my perspective it might just be useful for others if the IDs were consistent between the download files, that's all.

I'll leave it to you to decide on whether to close this issue - I have what I need. Thanks again.

fullama commented 10 months ago

i have hopefully fixed the spire_v1_microntology.tsv.gz file so there should be no more study names in there that are not on the site - thanks for that i had completely missed that there were study names in that file.