laurahspencer / DuMOAR

0 stars 0 forks source link

Run Diamond blastx and Megan on crab rna-seq reads #26

Closed kubu4 closed 11 months ago

kubu4 commented 1 year ago

References https://github.com/RobertsLab/resources/issues/1597

sr320 commented 1 year ago

@kubu4 what is the status of this?

kubu4 commented 1 year ago

Technically, the blasting/"MEGANizing" is finished. However, having extreme difficulties getting files opened in MEGAN GUI (required when using the free version) to extract taxonomic breakdown - the files are extremely large and MEGAN keeps crashing...

At this point, I'm mostly resigned to just looking at the smallest file, which only consists of a single set of "MEGANized" reads (R1 only, not paired), and hoping we can consider this one file a "representative" data set? I've gotten it to open, but wanted to have the corresponding R2 reads.

It takes a very long time (an hour or two) to open a file. So, when I attempt to open a file and then MEGAN crashes, it's a massive loss of time.

Anyway, I'll get something posted here sometime on Thursday.

sr320 commented 1 year ago

What is current file output (prior to GUI) - eg does it have some taxa info

On Tue, Jun 27, 2023 at 2:26 PM kubu4 @.***> wrote:

Technically, the blasting/"MEGANizing" is finished. However, having extreme difficulties getting files opened in MEGAN GUI (required when using the free version) to extract taxonomic breakdown - the files are extremely large and MEGAN keeps crashing...

At this point, I'm mostly resigned to just looking at the smallest file, which only consists of a single set of "MEGANized" reads (R1 only, not paired), and hoping we can consider this one file a "representative" data set? I've gotten it to open, but wanted to have the corresponding R2 reads.

It takes a very long time (an hour or two) to open a file. So, when I attempt to open a file and then MEGAN crashes, it's a massive loss of time.

Anyway, I'll get something posted here sometime on Thursday.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/laurahspencer/DuMOAR/issues/26*issuecomment-1610242009__;Iw!!K-Hz7m0Vt54!hHNWMw7gSwsslrvdkUN1yB4_nx8HKyfzCFm7wF97HR87lbuMKOhUosdlwR8Q9Xd8A9S_A1e76iBGSbT2Hu35B7A$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN2SRMSDQC3W4JHACC3XNNFX5ANCNFSM6AAAAAAYUSKKDA__;!!K-Hz7m0Vt54!hHNWMw7gSwsslrvdkUN1yB4_nx8HKyfzCFm7wF97HR87lbuMKOhUosdlwR8Q9Xd8A9S_A1e76iBGSbT2VuGigsg$ . You are receiving this because you commented.Message ID: @.***>

--

Steven B. Roberts, Professor Associate Director - Graduate Program Coordinator School of Aquatic and Fishery Sciences University of Washington Fisheries Teaching and Research (FTR) Building - Office 232 1140 NE Boat Street - Seattle, WA 98105 robertslab.info - @.*** - @sr320 vm:206.866.5141 - cell:360.362.3626 schedule a zoom call: https://d.pr/gsgxVJ

kubu4 commented 1 year ago

It's not a text file - it's some sort of compressed/binary format. Heres a link to the smallest file (68GB), if you want to poke around with it:

https://gannet.fish.washington.edu/Atumefaciens/20230316-mmag-diamond-meganizer-RNAseq/CH09-13.trimmed.R1.blastx.meganized.daa

kubu4 commented 1 year ago

Okay, I've looked into this a bit more. It turns out that I can convert the DAA files to RMA6 files via command line. I bring this up because it turns out that when importing the MEGANized DAA files into the MEGAN GUI, the GUI is actually converting them to RMA6 format! Gah!

The RMA6 file format, unsurprisingly (since it just contains taxonomic assignment counts/data), is significantly smaller in size (like 2GB vs. 68GB).

I'll get something set up to run on Mox to get these all converted. Then, I should be able to easily, and quickly(!), get them imported into the MEGAN GUI to extract taxonomic counts for all samples.

kubu4 commented 1 year ago

But, to tide us over until then, here's a quick visual breakdown of taxonomic assignment for the R1 reads for

Phylogenetic tree representing read counts assigned to various taxonomies. Largest circles represent the Proteobacteria (largest) and Arhtropoda classes.

sr320 commented 1 year ago

do circles indicate prevalence? eg more Proteobacter v Arthropoda?

kubu4 commented 1 year ago

Correct. Circles represent number of reads assigned within each taxa (legend in top left of that screenshot).

kristamnichols commented 1 year ago

Very interesting! Lots of microbes in the data eating our reads? Thanks for chasing after this.On Jun 29, 2023, at 10:33 AM, kubu4 @.***> wrote: Correct. Circles represent number of reads assigned within each taxa (legend in top left of that screenshot).

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

mgavery commented 1 year ago

Wow, lots of bacteria. I'm thinking about these samples and the fact that we put the whole gill filament into the tube. ..multiple filaments even because they were so small. There is a lot of surface area on those feathery gills and filaments for bacteria to hang out on. I'm wondering: 1) can MEGAN break down species any further? maybe not get to a species level but something to compare distributions of "types" of bacteria between OA and treated crabs; 2) what's up with all the not-assigned stuff? that's a big portion of the reads too.

kubu4 commented 1 year ago

1) can MEGAN break down species any further?

Definitely. I just posted a quick screencap at the Phylum level for people to glance at. It goes down to the species level.

2) what's up with all the not-assigned stuff?

Well, my guess is that's due to limitations of the BLAST results (which is what MEGAN is relying on). Since the data is only as good as the BLAST database (and there probably isn't a massive amount of crab sequencing data in NCBI), I'm guessing there are a LOT of matches to undefined proteins (or something similar). So, there's likely not much MEGAN can do with those sequencing reads to try to perform/predict taxonomic assignments.

We could possibly try to redo the analysis with some relaxed BLAST parameters and/or MEGAN parameters to see if it helps reduced the number of unassigned reads, but it will take awhile (BLASTing/MEGANizing took ~45 days and conversion to the RMA6 file for loading into the MEGAN GUI takes ~10 days)...

Another option is to extract the reads assigned to Arthropoda (and below), assemble transcriptome and then try to align the unassigned reads to the transcriptome to identify those that are likely crab sequences. Then, we'd extract those reads and use them to create an updated transcriptome.

kristamnichols commented 1 year ago

Great idea to explore differences in microbes in treated and untreated!

On Tue, Jul 11, 2023 at 9:56 AM kubu4 @.***> wrote:

  1. can MEGAN break down species any further?

Definitely. I just posted a quick screencap at the Phylum level for people to glance at. It goes down to the species level.

  1. what's up with all the not-assigned stuff?

Well, my guess is that's due to limitations of the BLAST results (which is what MEGAN is relying on). Since the data is only as good as the BLAST database (and there probably isn't a massive amount of crab sequencing data in NCBI), I'm guessing there are a LOT of matches to undefined proteins (or something similar). So, there's likely not much MEGAN can do with those sequencing reads to try to perform/predict taxonomic assignments.

We could possibly try to redo the analysis with some relaxed BLAST parameters and/or MEGAN parameters to see if it helps reduced the number of unassigned reads, but it will take awhile (BLASTing/MEGANizing took ~45 days and conversion to the RMA6 file for loading into the MEGAN GUI takes ~10 days)...

Another option is to extract the reads assigned to Arthropoda (and below), assemble transcriptome and then try to align the unassigned reads to the transcriptome to identify those that are likely crab sequences. Then, we'd extract those reads and use them to create an updated transcriptome.

— Reply to this email directly, view it on GitHub https://github.com/laurahspencer/DuMOAR/issues/26#issuecomment-1631169859, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3RZJOGIS6NBJI6TGJHJUITXPWAVDANCNFSM6AAAAAAYUSKKDA . You are receiving this because you commented.Message ID: @.***>

-- ................................................................... Krista M. Nichols, PhD Program Manager, Genetics and Evolution Conservation Biology Division Northwest Fisheries Science Center NOAA, National Marine Fisheries Service 2725 Montlake Blvd E Seattle, WA 98112 206.302.2470 (Google Voice)

kubu4 commented 1 year ago

Alrighty, I've uploaded output tables and screencaps:

https://github.com/laurahspencer/DuMOAR/tree/main/results/MEGAN

There is one table per FastQ. The format is:

Phylum Read Counts
Acidobacteria 1045
Aquificae 1253
Bacteroidetes 6709
Proteobacteria 99622
Chlamydiae 643
Planctomycetes 4057
Actinobacteria 4766
Firmicutes 1334
Microsporidia 32041
Chordata 211114
Nematoda 1718
Arthropoda 1187325
Mollusca 10101
Platyhelminthes 17161
Streptophyta 4859
Uroviricota 773
sr320 commented 1 year ago

added some data summary https://rpubs.com/sr320/1065657

laurahspencer commented 1 year ago

image

Red and blue colors on sample names indicate pH treatment (red=OA, blue=ambient)

Samples that we tossed during methylation analysis: CH05-06, CH07-24, CH07-11, CH03-04, CH03-15, CH10-08, CH09-13, CH05-01.

Neither treatment nor weird methylation data seem to line up with % RNASeq reads identified as arthropoda.

laurahspencer commented 1 year ago

No clear correlation between % reads assigned to Arthropoda (y-axis) and genome alignment rate (x-axis). NOTE: the % assigned to Arthropoda doesn't consider how many reads were not assigned.

image