Closed GoogleCodeExporter closed 8 years ago
Edit: my version is not the latest, but ea-utils.1.1.2-484
Original comment by joachim....@gmail.com
on 11 Feb 2013 at 9:57
The program, by default, is looking for lines with adapter strings out of
300000 samples taken. contaA,C,D,E,G,I,K,L were all found (counted 640 out of
.... ). Others were not found in sufficient numbers. So they were not
clipped.
You cold run with -t .00001 ... forcing it to clip if it sees just one example.
Note: This program should only be used to remove "ligated-sequencing adapters"
- not other forms of sequencing contamination. In this case only 6-7 bases
would have to match for the various sequence to be clipped. In typical
Illumina runs, only 1-2 adapters are found in high enough numbers to be clipped.
Original comment by earone...@gmail.com
on 12 Feb 2013 at 6:31
Note: If you attach a file containing at least 300k reads to this ticket, I can
test it out, make sure it's behaving as designed. Also, I can probably get a
-t 0 to work (right now I'm not sure what -t 0 will do)
Original comment by earone...@gmail.com
on 12 Feb 2013 at 6:33
Thanks for the explanation. Okay - so basically I am using it for the wrong
purpose... I search for overrepresented strings in my RNA-seq data: in this
list nearly always the sequencing primer pops up, next to other strings.
During my current tests, I try to use Fastq-mcf to remove all overrepresented
strings, not only sequencing adaptor contamination.
Biological contamination, I remove with mapping to the expected sources (rRNA,
mtDNA, phiX genome). But the 'technical contamination', I currently try to
remove with fastq-mcf, for adapter removal, a set of polyA sequences (yes,
which I consider also technical), and other unidentified overrepresented
strings.
So apparently, fastq-mcf is not designed for what I am trying to do: but since
polyA/T removal would be a nice feature (and since you already have the
N-removal feature at the end of reads), perhaps it's worthwhile and rather easy
to implement?
BTW, I have made a galaxy wrapper around fastq-mcf. With your consent I can
upload it to the toolshed once testing has ended. See http://galaxyproject.org/
Thanks.
Original comment by joachim....@gmail.com
on 13 Feb 2013 at 7:59
True, polyA/T is easy enough. I guess a parameter like -P 6, woudl remove 6
or more A/T's from the ends of reads *after* adapter removal. In my
experience, however, polyA tails on RNA are desirable, anchoring assembly and
alignment, and you're better off designing a a polyA transcriptome to align to
... the way that rsem-prepare-reference does it.
http://www.biomedcentral.com/1471-2105/12/323
Original comment by earone...@gmail.com
on 13 Feb 2013 at 3:16
Uploading to galaxy is fine. If you need me to make changes, let me know.
Also the new version is streaming (it buffers the 300k sampling reads in RAM...
so you can use it as a part of a proper pipeline).
Original comment by earone...@gmail.com
on 13 Feb 2013 at 3:17
Thanks for the interesting RSEM article. It has been a long time already on my
reading list (...), but now I've read it, and indeed contains a lot of valuable
information. I will definitely compare RSEM to my current workflow.
Indeed, the polyA may be desirable in the case of transcriptome alignment, in
contrast to genome alignment. So it is up to you whether to include the polyA
trimming. I can place the functionality of fastq-mcf now more into context.
Again, thanks for the effort!
Original comment by joachim....@gmail.com
on 14 Feb 2013 at 9:33
Even with genome alignment, programs like tophat and mapsplice deal with poly-A
tails on their own, rather than relying. I've done some tests using SEQC data
and exome/rna seq comparisons and have found that a) RSEM and eXpress are the
most accurate quantifiers (roughly equivocal) b) aligning to the transcriptome
is an important first step for any tool (then you can align unaligned reads to
the genome). and c) bwa is pretty much always better than bowtie for accuracy
but you have to deal with a lot of its operational shortcomings
Original comment by earone...@gmail.com
on 4 Mar 2013 at 2:19
Original comment by earone...@gmail.com
on 4 Mar 2013 at 2:28
The poly-A feature has been added.
Original comment by earone...@gmail.com
on 24 Apr 2013 at 2:04
Original issue reported on code.google.com by
joachim....@gmail.com
on 11 Feb 2013 at 9:44