Open ctb opened 1 year ago
Note: I don't think this is supposed to be a metagenome, this seems to be a genome assembly project!
gather
won't capture the actual organism because it is an euk and there are no euks in the rs207 reference database I used for gather
.
But the reads certainly seem to have microbial contamination going on =]
wort
computes signatures for euks as long as they are not animal or plant, since this is algae it was calculated too.
and also: that big file alone had a total of 23,000 bacterial genomes in the gather file and 12,300 archaea, protozoa, fungi, and viruses in the gather. Total of > 35,000.
from hugo.
per luiz:
we have a new champion! https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR21113412&display=metadata is the largest signature in wort, 4.1GB :joy:
trace.ncbi.nlm.nih.govtrace.ncbi.nlm.nih.gov SRA Archive: NCBI NCBI Sequence Read Archive
this fellow: https://www.ncbi.nlm.nih.gov/Taxonomy/taxi/images/15060
:joy: :sob:
took 25 minutes to run this sig describe :joy:
muuuuuch more manageable:
(26m for filter, 42s for describe)
it's cool how 51- and 31-mers cardinality increase for unfiltered, but decrease for filtered (when compared to 21-mers)
no guarantees on the quality of results, but mastiff gather took 19m54s on the filtered one, and is running for 2h+ on the original one :joy:
wondering how well isolated was the genome before sequencing :upside_down_face: ❯ head SRR21113412.csv GCF_002222635.1 Sulfitobacter pseudonitzschiae strain=SMR1, ASM222263v1 4648000 0.9335207873066881 GCF_005144905.1 Vibrio cyclitrophicus strain=ECSMB14105, ASM514490v1 4465000 0.9093686354378818 GCA_001562115.1 Alteromonas stellipolaris strain=LMG 21861, ASM156211v1 3835000 0.8013383521539105 GCA_007988745.1 Pseudoalteromonas atlantica strain=NBRC 103033, ASM798874v1 3800000 0.8508535489667565 GCF_002115725.1 Marivita cryptomonadis strain=CL-SK44, ASM211572v1 3349000 0.719412019022914 GCF_000733925.1 Arenibacter algicola strain=TG409, ASM73392v1 3287000 0.6098330241187384 GCA_000831005.1 Marinobacter salarius strain=R9SW1, ASM83100v1 2782000 0.6007785467128027 GCF_002890895.1 Pseudomonas stutzeri strain=4C29, ASM289089v1 2653000 0.5937288517933679 GCF_000014745.1 Maricaulis maris MCS10 strain=MCS10, ASM1474v1 2515000 0.7632776934749621 GCF_001447995.1 Maribacter dokdonensis DSW-8 strain=DSW-8, DSW8_denovo_v1 2388000 0.5383397421397874 luizirber
same index, but with -s 10000 takes 2m53s to run: ❯ head SRR21113412.csv GCF_002222635.1 Sulfitobacter pseudonitzschiae strain=SMR1, ASM222263v1 4570000 0.9364754098360656 GCF_005144905.1 Vibrio cyclitrophicus strain=ECSMB14105, ASM514490v1 4160000 0.8813559322033898 GCA_001562115.1 Alteromonas stellipolaris strain=LMG 21861, ASM156211v1 3930000 0.7875751503006012 GCA_007988745.1 Pseudoalteromonas atlantica strain=NBRC 103033, ASM798874v1 3880000 0.8308351177730193 GCF_002115725.1 Marivita cryptomonadis strain=CL-SK44, ASM211572v1 3400000 0.7100840336134454 GCF_000733925.1 Arenibacter algicola strain=TG409, ASM73392v1 3230000 0.6003717472118959 GCA_000831005.1 Marinobacter salarius strain=R9SW1, ASM83100v1 2850000 0.6319290465631929 GCF_002890895.1 Pseudomonas stutzeri strain=4C29, ASM289089v1 2790000 0.610989010989011 GCF_000014745.1 Maricaulis maris MCS10 strain=MCS10, ASM1474v1 2510000 0.7652439024390244 GCA_009649675.1 Alphaproteobacteria bacterium HT1-32 strain=HT1-32, ASM964967v1 2480000 0.5210970464135021
food for thought: can't do this easily without redownloading all the metagenomes and sketching with scaled=100, but a SRA metag index would be ~10TB (which fits in HDDs/SSDs) and would potentially allow viral queries?
unfiltered/s1000 finished after 3h34m, but I didn't save the output properly and can't compare :joy: unfiltered/s10000 is running now, should finish soon
unfiltered/s10000 took 37m29s, top results: