rhysnewell / aviary

A hybrid assembly and MAG recovery pipeline (and more!)
GNU General Public License v3.0
81 stars 12 forks source link

Skip QC produces concatenated (not interleaved) short-reads #166

Closed AroneyS closed 10 months ago

AroneyS commented 11 months ago

Short forward+reverse reads are concatenated in sequence instead of interleaved (expected by metaspades and megahit) https://github.com/rhysnewell/aviary/blob/5915015b114ef7284a3e50d559e204b4b6c5428d/aviary/modules/quality_control/scripts/qc_short_reads.py#L89

Please tell me I'm wrong...

rhysnewell commented 11 months ago

Only if you are skipping QC with fastp AND running coassembly does the concatenation happen. I also can't find anything in spades docs that says this would cause issues? It should be fine if it can find the next read pair somewhere in the file, but I am likely mistaken

AroneyS commented 11 months ago

Does spades not just assume interleaved reads and go with it?

AroneyS commented 11 months ago

What about megahit?

rhysnewell commented 11 months ago

Better to just change it to produce interleaved reads if it is a concern, the safest option would be to not allow the skipping of QC IMO

AroneyS commented 11 months ago

Yep, definitely a problem. Spades just assumes interleaved (since that is what the argument says).

e.g. (the reverse reads are named @SRR8943084.1 1/2)

# assemble/data/short_read_assembly/split_input/short_reads_1.fastq
@SRR8799000.1 1/1
TTCGCGAATATGTCTAAACGCATGGGAGAGATGGTTAGGGAAGAATTAGAATTACTGGGTCCTAAGCCATTGGCTGAAGTAGAGACAGCACAGAAAGAAATAGTTGATAGTCTTGTCAAACTGGAGGCTCAAGGAGAAACAATAAGGGGA
+
DDDDDIIIIIIIIIIHIIIIIIHGIHIIIIHIHIHHIIIIIIIHHHIHHIIIIIIIIIIIIIIIIIIIHIIIIIIHIIIIIIIIIIIIIHIIGHHHIIIIIHHHHHFIHIIIIIIIIIIIIIIGHIIIHIIIIHIIIIIIIIIIIIIHHH
@SRR8799000.3 3/1
CTAAACATGGGTGGTATAATGGAATCAAACACATTTACAAAGATATACCTCGCCATTTTTGGGCAATTTGATTGGCAGGGGATACCGGCACTACCAATAGAAGTAATTCTTCTGCCTAATTCGTTTTACTTTAACATTTATGAGTTTTCT
+
DDDDDIIIIHHHHIHHIIIHHHIIIIIIIIEHHHHHHIIIIGIIIHIIEHH=DHIIHHCH/CCHIIIHGIHIIG@EHHFDHDEHIHHDHHHHIHHHIIIIIGHHHHIHHEEHHIGGHHHGIH?HEHHHIIHIIHIHHHFHICHHFH@GHE
@SRR8799000.5 5/1
GTACTCCTGCAGCAGCTGCGCGAGGTGGGCCTGCCGCTCCTGGGAGGTCTTCCAGCGGCCCAGGGCGGGAGTGAGCTGGCGGATCTTCTCGATGTCCATGCGAGCCTGGCCGAGCTGGGCGGCGAAGCCCTTCACCGACAGCTGCGCCTC

# assemble/data/short_read_assembly/split_input/short_reads_2.fastq
@SRR8799000.2 2/1
TCCACTTGATTTTTTCCATCGCATTATCTAAAAATTTTTGTTCAATGTCACGGAGTACAACATTGTATCCTGCCATGGCAGAAACCTGAGCAATACCATGTCCCATAATACCAGAACCTAAAACTA
+
DDDDDHIHIIIIIGHIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIGHIIIIHGHHIIHIIEHHIHIIIIIIIIIIHHIIIIIIHIIGHGIHEHHHHHHIHIIIIIIGHIGHHIHIHH?FFCH
@SRR8799000.4 4/1
AAAGATACAGGTATCTCACTGAAAGTATTGAGAGTAGATGGAGGGGCAACTGCAAACAATTTCTTATGCCAATTCCAATCCGACATTCTTAACCTGGCTGTAAGCCGTCCAAAAANT
+
DDBDDIHIIIIEHHIIIIIFIFHIHGHHHHIHEHFHHHCHHIIIIHIGIGFHHGHHHGIIIIGHIIIIIIIIIIIIIIIFHIIHIIFEEHIIFHHHF@EHHGHHIIIIIIIIHGH#<
@SRR8799000.6 6/1
TATTGACAAACCAGAATTTTGTTTTGGTGCCCATTTGTTCAATCTGCAACTTCTCGTTTACGTTAAGGTTATATACTGGGTAAGATTGACACATTATGGACAAGCTTTCTCGATTCGGCTAATATTCATCTTATATATTACGATAAATGG
AroneyS commented 11 months ago

I need to do QC outside of Aviary to also do unmapping. Why does skip qc change the files anyway? Can't you just pass them through?

AroneyS commented 11 months ago

Only if you are skipping QC with fastp AND running coassembly does the concatenation happen.

Skip QC is sufficient. If you skip QC, then they are sequentially concatenated in line 89/90, if coassembly then in lines 64/75. Neither does deinterleaving.

AroneyS commented 11 months ago

Dang, I was hoping you were right. That's 6 weeks down the drain.

rhysnewell commented 11 months ago

Bruh, I'm sorry. Hang on, I've got a fix I'll push up to another branch