patrickwest / EukRep

Classification of Eukaryotic and Prokaryotic sequences from metagenomic datasets
MIT License
66 stars 12 forks source link

Taxon identifiers for training data? #9

Open jolespin opened 4 years ago

jolespin commented 4 years ago

Do you happen to have taxon identifiers for the training data in this file?

https://genome.cshlp.org/content/suppl/2018/03/22/gr.228429.117.DC1/Supplemental_Table_S1.xlsx

It's currently difficult to know which sequences are associated with these.

Here's the list but it's difficult to search the names. Did you you download them from NCBI or another database?

Organisms ``` Streptococcus pneumoniae TIGR4 Clostridium acetobutylicum ATCC 824 Staphylococcus carnosus subsp. carnosus TM300 Mycobacterium bovis BCG Pasteur 1173P2 Staphylococcus aureus subsp. aureus Mu50 Orientia tsutsugamushi str. Ikeda Colwellia psychrerythraea 34H Synechococcus sp. CC9605 Nitrosococcus oceani ATCC 19707 Anaeromyxobacter dehalogenans 2CP-C Mycobacterium vanbaalenii PYR-1 Prochlorococcus marinus str. AS9601 Prochlorococcus marinus str. MIT 9515 Shewanella loihica PV-4 Rickettsia akari str. Hartford Coxiella burnetii RSA 331 Synechococcus sp. PCC 7002 Rhodobacter sphaeroides KD131 Sulfurihydrogenibium azorense Az-Fu1 Vibrio cholerae M66-2 Desulfovibrio desulfuricans subsp. desulfuricans str. ATCC 27774 Ammonifex degensii KC4 Thermobaculum terrenum ATCC BAA-798 Bacillus megaterium QM B1551 Lactobacillus casei BL23 Streptococcus equi subsp. equi 4047 Streptococcus gallolyticus UCN34 Gluconacetobacter diazotrophicus PAl 5 Helicobacter mustelae 12198 Agrobacterium tumefaciens str. C58 Syntrophothermus lipocalidus DSM 12680 Shigella sonnei Ss046 Legionella pneumophila 2300/99 Alcoy Prevotella melaninogenica ATCC 25845 Acetohalobium arabaticum DSM 5501 Corynebacterium pseudotuberculosis 1002 Cyanothece sp. PCC 7822 Helicobacter pylori SJM180 Streptococcus parasanguinis ATCC 15912 Alicycliphilus denitrificans BC Escherichia coli O83:H1 str. NRG 857C Mycoplasma bovis PG45 clone MU clone A2 Bacillus subtilis BSn5 Thermus scotoductus SA-01 Staphylococcus pseudintermedius ED99 Acinetobacter calcoaceticus PHEA-2 Lactobacillus amylovorus strain 30SC Mycobacterium tuberculosis H37Rv Lactobacillus buchneri NRRL B-30929 Leuconostoc sp. C2 Weissella koreensis KACC 15510 Collimonas fungivorans Ter331 Streptococcus suis SS12 Salmonella enterica subsp. enterica serovar Typhimurium str. 798 Emticicia oligotrophica DSM 17448 Aequorivita sublithincola DSM 14238 Terriglobus roseus DSM 18391 Pantoea ananatis LMG 5342 Escherichia coli ST131 Mycoplasma gallisepticum VA94_7994-1-7P Mycoplasma genitalium M2288 Thermus oshimai JL-2 Geitlerinema sp. PCC 7407 Chlamydia trachomatis L1/224 Mycoplasma pneumoniae PO1 Thermoanaerobacterium thermosaccharolyticum M0795 Candidatus Blochmannia chromaiodes str. 640 Bacillus subtilis XF-1 Sinorhizobium meliloti 2011 Raoultella ornithinolytica B6 Lactobacillus rhamnosus LOCK908 Streptococcus agalactiae 09mas018883 Streptococcus agalactiae ILRI005 Listeria monocytogenes strain N1-011A Chlamydia trachomatis RC-F(s)/342 Proteus mirabilis BB2000 Staphylococcus aureus subsp. aureus SA957 Mycoplasma parvum str. Indiana Halyomorpha halys symbiont DNA Lactococcus lactis subsp. lactis KLDS 4.0325 Burkholderia pseudomallei NCTC 13179 Rhizobium leguminosarum bv. trifolii CB782 Listeria monocytogenes WSLC1042 Salmonella enterica subsp. enterica serovar Enteritidis str. EC20120005 Salmonella enterica subsp. enterica serovar Enteritidis str. EC20110361 Salmonella enterica subsp. enterica serovar Enteritidis str. EC20110353 Azospirillum brasilense strain Az39 Brucella canis strain SVA13 Acinetobacter baumannii strain AC29 Bacillus methanolicus MGA3 Neorhizobium galegae Enterococcus faecium T110 Flavobacterium psychrophilum strain CSF259-93 Acinetobacter baumannii strain AB030 Burkholderia mallei strain FMH 23344 Burkholderia cepacia strain DDS 7H-2 Paenibacillus sp. FSL H7-0737 Burkholderia cenocepacia strain DWS 0.37 Streptococcus pyogenes strain 7F7 Bacillus cereus strain 03BB87 Hymenobacter sp. DG25B Staphylococcus aureus strain 33b Francisella guangzhouensis strain 08HL01032 Staphylococcus hyicus strain ATCC 11249 Aeromonas hydrophila J-1 Acinetobacter baumannii NCGM 237 Bacillus coagulans DSM 1 = ATCC 7050 Mycoplasma capricolum subsp. capripneumoniae 87001 Xanthomonas citri subsp. citri strain MN10 Xanthomonas citri subsp. citri strain AW14 Mannheimia haemolytica strain 89010807N lktA- Mycoplasma gallinaceum strain B2096 8B Oleispira antarctica strain RB-8 Xenorhabdus poinarii str. G6 Staphylococcus aureus strain FCFHV36 Thermotoga maritima strain Tma200 Streptococcus agalactiae strain SS1 Xanthomonas oryzae pv. oryzicola strain CFBP2286 Chlamydia trachomatis D/CS637/11 Planococcus sp. L10.15 Pseudomonadaceae bacterium C6819 Mycobacterium bovis BCG strain Russia 368 Streptococcus mitis strain KCOM 1350 (= ChDC B183) Bifidobacterium breve strain BR3 Synechocystis sp. PCC 6803 substrain GT-G Burkholderia cepacia ATCC 25416 chromosome 1 Campylobacter jejuni strain CJM1cam Corynebacterium pseudotuberculosis strain 1002B Rickettsia rhipicephali strain HJ#5 Bacillus amyloliquefaciens strain MBE1283 Salmonella enterica subsp. enterica serovar Enteritidis strain CMCC50041 Campylobacter jejuni strain CJ677CC532 Bacillus cereus strain FORC_013 Klebsiella oxytoca DNA complete genome strain: JKo3 Streptococcus mutans strain NG8 Achromobacter xylosoxidans strain FDAARGOS_162 Alteromonas sp. Mac1 Rhodobacter sphaeroides strain MBTLJ-8 Rickettsia prowazekii strain Naples-1 Bordetella pertussis strain E476 Bordetella pertussis strain I480 Bordetella pertussis strain I669 Mycobacterium abscessus strain FLAC003 Streptomyces ambofaciens strain DSM 40697 Azospirillum humicireducens strain SgZ-5 Rhizobium phaseoli strain R611 Lactobacillus plantarum strain NCU116 Bacillus anthracis strain Parent1 Bacillus anthracis strain PR01 Serinicoccus sp. JLT9 Vibrio scophthalmi strain VS-12 Enterococcus faecalis strain KB1 Candidatus Tremblaya princeps isolate TPMHIR1 Plesiomonas shigelloides strain NCTC10360 Atribacteria bacterium SCGC AAA255-E04 Aerophobetes bacterium JGI 0000014-C22 Candidate division TM6 bacterium GW2011_GWF2_30_66 UR12_C0001 Candidate division WS6 bacterium GW2011_GWC1_36_11 UR96_C0001 Berkelbacteria bacterium GW2011_GWA1_36_9 US31_C0001 Candidatus Falkowbacteria bacterium GW2011_GWC2_38_22 US83_C0001 Candidatus Curtissbacteria bacterium GW2011_GWC2_38_9 UT12_C0001 Candidate division WS6 bacterium GW2011_GWF2_39_15 UT34_C0001 Candidatus Daviesbacteria bacterium GW2011_GWA2_39_33 UT45_C0001 Candidate division CPR2 bacterium GW2011_GWD1_39_7 UT59_C0001 Candidatus Levybacteria bacterium GW2011_GWB1_41_21 UU52_C0001 Candidatus Giovannonibacteria bacterium GW2011_GWC2_44_9 UW81_C0001 Candidatus Gottesmanbacteria bacterium GW2011_GWA2_42_18 UV09_C0001 Candidatus Magasanikbacteria bacterium GW2011_GWC2_42_27 UV18_C0001 Candidate division WWE3 bacterium GW2011_GWA1_43_94 UW13_C0001 Candidatus Collierbacteria bacterium GW2011_GWA1_45_15 UW96_C0001 Candidatus Azambacteria bacterium GW2011_GWD2_46_48 UX56_C0001 Candidate division Kazan bacterium GW2011_GWC1_52_13 VE99_C0001 Candidate division WOR_3 bacterium SM1_77 WOR1_30_36_10180 Candidate division WOR-1 bacterium RIFOXYB2_FULL_36_35 Candidate division CPR3 bacterium RIFOXYB2_FULL_35_8 Candidate division WWE3 bacterium RIFCSPHIGHO2_12_FULL_38_15 Candidate division WWE3 bacterium RIFOXYB1_FULL_42_27 Candidatus Abawacabacteria bacterium RBG_16_42_10 Candidatus Amesbacteria bacterium RIFCSPLOWO2_01_FULL_48_50 Candidatus Adlerbacteria bacterium RIFCSPHIGHO2_12_FULL_53_18 Candidatus Beckwithbacteria bacterium RIFCSPHIGHO2_02_FULL_49_13 Candidatus Daviesbacteria bacterium RIFCSPLOWO2_01_FULL_39_23 Candidatus Doudnabacteria bacterium RIFCSPHIGHO2_01_FULL_46_24 Candidatus Firestonebacteria bacterium GWA2_43_8 Candidatus Glassbacteria bacterium RBG_16_58_8 Candidatus Blackburnbacteria bacterium RIFCSPHIGHO2_02_FULL_44_20 Candidatus Chisholmbacteria bacterium RIFCSPHIGHO2_01_FULL_48_12 Candidatus Brennerbacteria bacterium RIFOXYD1_FULL_41_16 Candidatus Buchananbacteria bacterium RIFCSPHIGHO2_02_FULL_56_16 Candidatus Colwellbacteria bacterium RIFCSPHIGHO2_02_FULL_45_17 Candidatus Harrisonbacteria bacterium RIFCSPLOWO2_01_FULL_40_28 Candidatus Komeilibacteria bacterium RIFOXYD1_FULL_37_29 Candidatus Liptonbacteria bacterium RIFCSPHIGHO2_01_FULL_57_28 Candidatus Komeilibacteria bacterium RIFOXYD2_FULL_37_8 Methanococcus voltae A3 Methanosarcina mazei strain Goe1 Methanopyrus kandleri AV19 Pyrobaculum aerophilum str. IM2 Nanoarchaeum equitans Kin4-M Picrophilus torridus DSM 9790 Methanothermobacter thermautotrophicus str. Delta H Archaeoglobus fulgidus DSM 4304 Methanocella paludicola SANAE DNA Pyrococcus horikoshii OT3 DNA Aeropyrum pernix K1 DNA Methanococcus maripaludis strain S2 Methanococcoides burtonii DSM 6242 Hyperthermus butylicus DSM 5456 Thermofilum pendens Hrk 5 Methanocorpusculum labreanum Z Methanoculleus marisnigri JR1 Methanococcus maripaludis C5 Pyrobaculum arsenaticum DSM 13514 Methanobrevibacter smithii ATCC 35061 Methanococcus vannielii SB Methanococcus aeolicus Nankai-3 Methanococcus maripaludis C7 Candidatus Methanoregula boonei 6A8 Ignicoccus hospitalis KIN4/I Caldivirga maquilingensis IC-167 Thermococcus onnurineus NA1 Nitrosopumilus maritimus SCM1 Methanococcus maripaludis C6 Candidatus Korarchaeum cryptofilum OPF8 Thermoproteus neutrophilus V24Sta Desulfurococcus kamchatkensis 1221n Candidatus Methanosphaerula palustris E1-9c Halorubrum lacusprofundi ATCC 49239 Thermococcus gammatolerans EJ3 Sulfolobus islandicus L.S.2.15 Sulfolobus islandicus M.14.25 Sulfolobus islandicus Y.G.57.14 Sulfolobus islandicus Y.N.15.51 Halomicrobium mukohataei DSM 12286 Methanobrevibacter ruminantium M1 Methanocaldococcus vulcanius M7 Haloterrigena turkmenica DSM 5511 Natrialba magadii ATCC 43099 Methanohalophilus mahii DSM 5219 Natronomonas pharaonis DSM 2160 Methanocella arvoryzae MRE50 Halobacterium salinarum R1 Methanothermobacter marburgensis str. Marburg Methanoplanus petrolearius DSM 11571 Thermococcus barophilus MP Thermococcus sp. AM4 Ferroplasma acidarmanus fer1 Methanothermus fervidus DSM 2088 Halogeometricum borinquense DSM 11551 Methanothermococcus okinawensis IH1 Desulfurococcus mucosus DSM 2162 Sulfolobus islandicus HVE10/4 Vulcanisaeta moutnovskia 768-28 Thermoproteus uzoniensis 768-20 Archaeoglobus veneficus SNP6 Methanosarcina barkeri str. Fusaro Methanohalobium evestigatum Z-7303 Methanosaeta concilii GP-6 Metallosphaera cuprina Ar-4 Acidianus hospitalis W1 Methanobacterium paludis strain SWAN1 Halopiger xanaduensis SH-6 Thermococcus sp. 4557 Pyrolobus fumarii 1A Haloarcula hispanica ATCC 33960 Halophilic archaeon DL31 Natronobacterium gregoryi SP2 Natrinema pellirubrum DSM 15624 Halobacterium sp. DL1 Pyrobaculum ferrireducens strain 1860 Methanosaeta harundinacea 6Ac Haloquadratum walsbyi C23 Thermogladius cellulolyticus 1633 Pyrococcus furiosus COM1 Natrinema sp. J7-2 Candidatus Nitrosopumilus koreensis AR1 Candidatus Nitrosopumilus sp. AR2 Candidatus Methanomethylophilus alvus Mx1201 Methanoculleus bourgensis MS2T Methanolobus psychrophilus R15 Haloferax mediterranei ATCC 33500 Caldisphaera lagunensis DSM 15908 Methanoregula formicicum SMSP Aciduliprofundum sp. MAR08-339 Halovivax ruber XH-70 Methanomethylovorans hollandica DSM 15978 Natronococcus occultus SP4 Sulfolobus acidocaldarius N8 Methanosarcina mazei Tuc01 Thermoplasmatales archaeon BRNA1 Archaeoglobus sulfaticallidus PM70-1 Salinarchaeum sp. Harcht-Bsk1 Candidatus Methanomassiliicoccus intestinalis Issoire-Mx1 Halorhabdus tiamatea SARL4B Haloarcula hispanica N601 Sulfolobus acidocaldarius SUSAZ Thermococcus sp. ES1 Thermococcus nautili strain 30-1 Aeropyrum camini SY1 = JCM 12091 Natronomonas moolapensis 8.8.11 Nitrososphaera viennensis EN76 Candidatus Nitrososphaera evergladensis SR1 Archaeoglobus fulgidus DSM 8774 Methanocaldococcus bathoardescens strain JH146 Methanobacterium formicicum strain BRM9 Thermococcus eurythermalis strain A501 Geoglobus acetivorans strain SBH6 Candidatus Methanoplasma termitum strain MpT1 Candidatus Nitrosopelagicus brevis strain CN25 Thermofilum carboxyditrophus 1505 Thermococcus guaymasensis DSM 11113 Haloarcula sp. CBA1115 Archaeon GW2011_AR10 Methanobacterium formicicum genome assembly DSM1535 Sulfolobus solfataricus strain SULB Sulfolobus solfataricus strain SULC Sulfolobus solfataricus strain SULA Methanosarcina sp. WWM596 Methanosarcina barkeri str. Wiesmoor Methanosarcina siciliae T4/M Methanosarcina siciliae HI350 Methanosarcina mazei WWM610 Methanosarcina mazei LYC Methanosarcina mazei C16 Methanosarcina lacustris Z-7289 Methanosarcina horonobensis HB-1 Methanosarcina barkeri 3 Methanococcoides methylutens MM1 Thermofilum sp. 1807-2 Geoglobus ahangari strain 234 Halanaeroarchaeum sulfurireducens strain HSR2 Pyrobaculum sp. WP30 Haloferax gibbonsii strain ARA6 Metallosphaera sedula strain ARS50-1 Metallosphaera sedula strain ARS120-2 Metallosphaera sedula strain SARC-M1 Halanaeroarchaeum sulfurireducens strain M27-SA2 Pyrodictium delaneyi strain Su06 Thermococcus barophilus strain CH5 Methanobrevibacter millerae strain SM9 Ignicoccus islandicus DSM 13165 Thermococcus sp. 2319x1 Halobacterium hubeiense genome assembly Halobacterium hubeiense JI20-1 Nanoarchaeota archaeon 7A Methanogenic archaeon ISO4-H5 Methanobrevibacter olleyae strain YLM1 Pyrococcus sp. NCB100 Thermococcus sp. CDGS Methanoculleus sp. MAB1 isolate Methanoculleus sp MAB1 Sulfolobus solfataricus strain P1 Aigarchaeota archaeon SCGC AAA471-E14 Aigarchaeota archaeon SCGC AAA471-B22 Aigarchaeota archaeon JGI 0000001-A7 Aigarchaeota archaeon JGI 0000106-J15 Aigarchaeota archaeon SCGC AAA471-E14 Aigarchaeota archaeon SCGC AAA471-E14 Aigarchaeota archaeon SCGC AAA471-F17 Candidatus Thorarchaeota archaeon SMTZ1-83 Crenarchaeota archaeon SCGC AAA471-B05 Crenarchaeota archaeon SCGC AAA471-L14 Thermoplasmatales archaeon DG-70 15865 Acidilobus saccharovorans 345-15 Caldisphaera lagunensis DSM 15908 Desulfurococcus fermentans DSM 16532 Ignisphaera aggregans DSM 17230 Staphylothermus hellenicus DSM 12710 Thermogladius cellulolyticus 1633 Thermosphaera aggregans DSM 11486 Hyperthermus butylicus DSM 5456 Pyrolobus fumarii 1A Fervidicoccus fontis Kam940 Acidianus hospitalis W1 Metallosphaera sedula DSM 5348 Sulfolobus solfataricus P2 Sulfolobales archaeon Acd1 Sulfolobales archaeon AZ1 isolate Thermofilum pendens Hrk 5456 Caldivirga maquilingensis IC-167 Thermoproteus neutrophilus V24Sta Thermoproteus uzoniensis 768-20 Hadesarchaea archaeon YNP_45 Hadesarchaea archaeon YNP_N21 Candidatus Korarchaeum cryptofilum OPF8 Nanoarchaeota archaeon SCGC AAA011-G17 Nanoarchaeota archaeon SCGC AAA011-L15 Nanoarchaeum equitans Kin4-M Candidatus Haloredivivus sp. G17 Candidatus Nanosalinarum sp. J07AB56 Candidatus Micrarchaeum acidiphilum ARMAN-2 Candidatus Parvarchaeum acidophilus ARMAN-5_'5-way FS' Cenarchaeum symbiosum A Thaumarchaeota archaeon CSP1-1 Thaumarchaeota archaeon SCGC AB-539-E09 Candidatus Nitrosopumilus koreensis AR1 Candidatus Nitrosoarchaeum koreensis MY1 MY1 Nitrososphaera viennensis EN76 Thaumarchaeota archaeon RBG_16_49_8 Thaumarchaeota archaeon MY2 NKMY2_1 Thaumarchaeota archaeon SCGC AAA282-K18 Thaumarchaeota archaeon SCGC AB-179-E04 Candidatus Caldiarchaeum subterraneum DNA Marine Group I thaumarchaeote SCGC AB-629-I23 Marine Group III euryarchaeote SCGC AAA288-E19 Thaumarchaeota archaeon SCGC AAA007-O23 Marine Group I thaumarchaeote SCGC AAA799-E16 Marine Group I thaumarchaeote SCGC AAA799-N04 Candidatus Micrarchaeota archaeon RBG_16_49_10 Candidate division WOR_3 bacterium SM1_77 Candidatus Micrarchaeota archaeon RBG_16_36_9 Candidatus Nanopusillus sp. Nst1 halophilic archaeon J07HX64 Vulcanisaeta distributa DSM 14429 Plasmodium vivax Trypanosoma brucei Leishmania major strain Friedlin Plasmodium falciparum strain 3D7 Eimeria tenella Leishmania braziliensis MHOM/BR/75/M2904 Leishmania infantum JPCM5 Theileria annulata strain Ankara Dictyostelium discoideum Plasmodium knowlesi strain H Toxoplasma gondii ME49 chromosome Ia Thalassiosira pseudonana CCMP1335 Phaeodactylum tricornutum CCAP 1055/1 Cryptosporidium parvum Iowa II Theileria parva strain Muguga Neospora caninum Liverpool Cryptosporidium parvum Trypanosoma brucei gambiense DAL972 Leishmania donovani BPK282A1 Leishmania mexicana MHOM/GT/2001/U1103 Ectocarpus siliculosus strain Ec 32 Plasmodium cynomolgi strain B Crithidia fasciculata strain Cf-Cl Babesia equi strain WA Leishmania sp. MAR LEM2494 Leishmania donovani strain BHU 1220 Nannochloropsis gaditana strain B-31 Babesia microti strain RI Plasmodium coatneyi strain Hackeri Theileria orientalis strain Shintoku Leishmania panamensis strain MHOM/PA/94/PSC-1 Babesia bigemina genome assembly Bbig001 Leishmania peruviana PAB-4377_V1 Plasmodium reichenowi strain SY57 Plasmodium gaboni strain SY75 Plasmodium berghei Plasmodium yoelii Laurentiella Helicosporidium sp. ATCC 50920 Coccomyxa sp. LA000219 Trebouxia gelatinosa isolate LA000220 Chlorella pyrenoidosa strain FACHB-9 Chlamydomonas applanata Yarrowia lipolytica CLIB122 Schizosaccharomyces pombe Zygosaccharomyces rouxii strain CBS732 Candida dubliniensis CD36 Encephalitozoon intestinalis Saccharomyces kluyveri NRRL Y-12651 Aspergillus oryzae RIB40 Mycosphaerella graminicola IPO323 Myceliophthora thermophila ATCC 42464 Thielavia terrestris NRRL 8126 Eremothecium cymbalariae Fusarium graminearum PH-1 Encephalitozoon hellem Kazachstania africana CBS 2517 Saccharomyces cerevisiae R103 Valsa mali strain 03-8 Saccharomyces cerevisiae YJM1443 Saccharomyces cerevisiae YJM1447 Saccharomyces cerevisiae YJM1477 Saccharomyces cerevisiae YJM1479 Saccharomyces cerevisiae YJM1615 Sporisorium scitamineum strain SSC39 Torulaspora delbrueckii strain NRRL Y-50541 Eremothecium sinecaudum strain ATCC 58844 Kluyveromyces marxianus isolate B0399 Oikopleura dioica Lottia gigantea ```
patrickwest commented 4 years ago

I unfortunately do not have a taxonomic identifier for them, ie NCBI's taxonomy IDs. The genomes were collected mostly from NCBI and JGI's mycocosm, but a few were from JGI's genome portal as well.

I do have the sequences compiled still and could send you a copy if you're interested. Please send me an email if so.

jolespin commented 4 years ago

Yea, I had a feeling they would contain a mix of NCBI and JGI. Definitely interested in playing around with the fasta database if it's available. How big is the (compressed?) fasta file database? Thanks Patrick!

jolespin commented 2 years ago

Are you still able to transfer over the fasta files?

patrickwest commented 2 years ago

Sorry this slipped my mind, the compressed archive is 1.3 gbs. See if you can get it from here: https://figshare.com/articles/dataset/eukrep_training_genomes_tar_gz/15145041