snayfach / IGGdb

Database of genomes integrated from the gut microbiome and other environments
GNU General Public License v3.0
44 stars 9 forks source link

IGG genomes #6

Open YiJessePi opened 4 years ago

YiJessePi commented 4 years ago

Can you please provide more information about the 206,581 IGG genomes that were clustered into into 23,790 representative genomes? I understand that it includes samples from the HGM, PATRIC and IMG datasets. Can you tell which and when the genomes were downloaded from the PATRIC and IMG? All of them were reconstructed from the human gut? Many thanks!

snayfach commented 4 years ago

The HGM dataset contains MAGs from human gut metagenomes. The PATRIC and IMG datasets are mainly isolate genomes, both from gut and non-gut environments. You'll have to refer to the publication for exact #s: https://www.nature.com/articles/s41586-019-1058-x

There should be a metadata file that contains information on all 23K OTUs, including which are found in the human gut.

YiJessePi commented 4 years ago

Thanks for the prompt response! I'm interested in the host of the genomes. I've looked on the table S10 in the referred paper, and beside of human-associated found host-associated and gut-associated (which I assume are non-human). Do you know how can I find the host data? or anything about the hosts distribution in your db? (something like x% human y% mice etc.) Thanks again!

snayfach commented 4 years ago

Unfortunately I don't have that data off hand. The best way would be to look at the genome_metadata file on the PATRIC FTP site: ftp://ftp.patricbrc.org/RELEASE_NOTES/. I think they have a field indicating the host for a subset of the genomes.

adityabandla commented 4 years ago

Is there any documentation/code for harvesting the 200,000 odd genomes from PATRIC & IMG? I am trying to build the database for using the conspecific module of MAGpurify

snayfach commented 4 years ago

For PATRIC, the genomes can be downloaded from ftp://ftp.patricbrc.org. These should cover the majority of genomes in IMG. Let me know if you need additional help or pointers.

On Mon, Sep 30, 2019 at 11:07 AM Aditya Bandla notifications@github.com wrote:

Is there any documentation/code for harvesting the 200,000 odd genomes from PATRIC & IMG? I am trying to build the database for using the conspecific module of MAGpurify

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/snayfach/IGGdb/issues/6?email_source=notifications&email_token=AAQBXLN2GABF6ASQOS3B3FDQMI56TA5CNFSM4ITKMPB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD76R24I#issuecomment-536681841, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQBXLMLTX6ZIHWT64HTYILQMI56TANCNFSM4ITKMPBQ .

adityabandla commented 4 years ago

Thanks, Stephen! I was able to get the genome downloads going from PATRIC, however, I would also like to gather genomes from IMG to capture those described in https://www.nature.com/articles/s41588-017-0012-9

snayfach commented 4 years ago

OK - I'll see if I can put together a download for you. This would be the exact set of nr genomes from PATRIC/IMG used in my publication. Let me know if this works for you.

On Mon, Sep 30, 2019 at 11:35 AM Aditya Bandla notifications@github.com wrote:

Thanks, Stephen! I was able to get the genome downloads going from PATRIC, however, I would also like to gather genomes from IMG to capture those described in https://www.nature.com/articles/s41588-017-0012-9

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/snayfach/IGGdb/issues/6?email_source=notifications&email_token=AAQBXLJCSLFKF6Q3CCWWXZ3QMJBIPA5CNFSM4ITKMPB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD76UW4A#issuecomment-536693616, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQBXLIK3GPX2YMZSRYQ4XTQMJBIPANCNFSM4ITKMPBQ .

adityabandla commented 4 years ago

Stephen, that'd be great!

snayfach commented 4 years ago

I've sent you a link. Please let me know if you have any trouble with it. You likely need close to 1 TB to unpack the tarball and 170 GB to download it

adityabandla commented 4 years ago

@snayfach Thanks! Can you please share it with abandla@nus.edu.sg which links to my business Dropbox account?

adityabandla commented 4 years ago

@snayfach Stephen, thanks for making the nr genomes available. We were trying to build sketches for this set, but ran into a few issues. First, in supplementary table 9 (from the paper), there seems to be bunch of duplicate genome ID's. Second, although the number of genomes in the excel match with those in the dataset you shared, about 8k in the dataset do not have corresponding ID's in the excel

snayfach commented 4 years ago

The duplicate ids in table S9 are due to an error in the excel formatting that stripped trailing 0s. Please find the mapping between genome_id and genome_name in the attached document.

There should be a 1:1 mapping between genomes in table S9 and the shared dataset. If not - please provide an example for me to look into. Hopefully the updated table of ids will resolve that issue.

genome_name_to_id.txt

adityabandla commented 4 years ago

We first tried removing the genome id's from the excel and then merging the excel & txt file based on the genome_name column, but duplicate genome names seem to be an issue

We tried this again just on the subset of duplicated id's, but still there seems an issue with the names

snayfach commented 4 years ago

I've uploaded the (nearly) complete version of table S9 with fixed ids. Genome quality level is not included, but that's just based on the completeness, contamination, N50, and # of contigs, which are all included. I also didn't include the clustered genome ids - so let me know if that's needed.

Table_S9.txt

adityabandla commented 4 years ago

Thanks Stephen for looking into this and helping us out. It would be great to have the clustered genome id's if possible, since we are trying to replicate the database from scratch as well

snayfach commented 4 years ago

Please see attached for the reference genomes clustered at a mash distance of 0.0.

clustered_genomes.txt

ashwinssudarshan commented 4 years ago

Hi Stephen! I was trying to extract the isolates from PATRIC and IMG and was wondering as to how you went about extracting the isolate genomes from PATRIC?

snayfach commented 4 years ago

They should be available via their FTP site: ftp://ftp.patricbrc.org/

On Wed, Nov 6, 2019 at 1:32 AM ashwinssudarshan notifications@github.com wrote:

Hi Stephen! I was trying to extract the isolates from PATRIC and IMG and was wondering as to how you went about extracting the isolate genomes from PATRIC?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/snayfach/IGGdb/issues/6?email_source=notifications&email_token=AAQBXLPA4Q24AOSLJEQ2FPDQSKFKFA5CNFSM4ITKMPB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDF4NGQ#issuecomment-550225562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQBXLL3YO4HV7YMHRAIQETQSKFKFANCNFSM4ITKMPBQ .

Caiyulu-818 commented 3 years ago

Can you please provide more information about the 206,581 IGG genomes that were clustered into into 23,790 representative genomes? I understand that it includes samples from the HGM, PATRIC and IMG datasets. Can you tell which and when the genomes were downloaded from the PATRIC and IMG? All of them were reconstructed from the human gut? Many thanks!

hello, I am the same to you for the problem, and can you share the IMG and the Patric database with the email lucyncs123@gamil.com? thanks a lot.

Caiyulu-818 commented 3 years ago

I've sent you a link. Please let me know if you have any trouble with it. You likely need close to 1 TB to unpack the tarball and 170 GB to download it

great, hi Stephen can you share the link to the email lucyncs123@gmail.com ,I have already sent you a email ,thanks o lot.