shandley / hecatomb

hecatomb is a virome analysis pipeline for analysis of Illumina sequence data
MIT License
55 stars 12 forks source link

include info about potential false positives in README? #11

Closed mihinduk closed 3 years ago

mihinduk commented 4 years ago

Should we work on a list of potential false positives with reasoning for the paper? I think we should include Ebrahim and Rob's work on the Poxviridae and the lines and sines. We could include a warning about the need for follow up with the large dsDNA viruses (CRISPR-Cas related function in mimiviruses; transposons)?

shandley commented 4 years ago

This is a great idea. We can provide a 'provisional' list now. Some analysis of the false-positive Pox sequences could be done and based on what Ebrahim and Rob did there is a path forward (examining alu repeats).

The large dsDNA viruses need some more detailed examination though. It is a bit unclear on how we could show those are always false-positives other than that they only align with very low quality. However, those sequences could potentially (although unlikely) be from novel large dsDNA viruses.

This is a great small research project. Let's discuss who has time to approach this challenging but important problem in our next meeting.

mihinduk commented 4 years ago

I also wonder if adding a DUST step for low complexity sequences would help.

shandley commented 4 years ago

Yes, and this was a big part of virus seeker that is currently (and perhaps naively) ignored in hecatomb. I think it was actually valuable to examine the data without a dust filter. It is leading to insights about our data (such as the pox virus) so there is an argument for leaving it turned off. There are also some low-complexity masking steps in both the db generation and sequencing processing of hecatomb already (can't recall at which steps, but bbtools does do some low-entropy masking).

This requires a bit more investigation and thought.