Closed mihinduk closed 3 years ago
This is a great idea. We can provide a 'provisional' list now. Some analysis of the false-positive Pox sequences could be done and based on what Ebrahim and Rob did there is a path forward (examining alu repeats).
The large dsDNA viruses need some more detailed examination though. It is a bit unclear on how we could show those are always false-positives other than that they only align with very low quality. However, those sequences could potentially (although unlikely) be from novel large dsDNA viruses.
This is a great small research project. Let's discuss who has time to approach this challenging but important problem in our next meeting.
I also wonder if adding a DUST step for low complexity sequences would help.
Yes, and this was a big part of virus seeker that is currently (and perhaps naively) ignored in hecatomb. I think it was actually valuable to examine the data without a dust filter. It is leading to insights about our data (such as the pox virus) so there is an argument for leaving it turned off. There are also some low-complexity masking steps in both the db generation and sequencing processing of hecatomb already (can't recall at which steps, but bbtools does do some low-entropy masking).
This requires a bit more investigation and thought.
Should we work on a list of potential false positives with reasoning for the paper? I think we should include Ebrahim and Rob's work on the Poxviridae and the lines and sines. We could include a warning about the need for follow up with the large dsDNA viruses (CRISPR-Cas related function in mimiviruses; transposons)?