output problem - Githubissues

marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads

https://cutadapt.readthedocs.io

MIT License

523 stars 129 forks source link

output problem #114

Closed thierrygosselin closed 8 years ago

thierrygosselin commented 9 years ago

Hi Marcel,

I'm using cutadapt with the options : -m 80 and --too-short-output I get this output: === Summary ===

Total reads processed: 3,204,994 Reads with adapters: 1,154,435 (36.0%) Reads that were too long: 759,316 (23.7%) Reads written (passing filters): 2,445,678 (76.3%)

Shouldn't it say "Reads that were too short" instead of "Reads that were too long" ? Because I'm redirecting in a file the reads < 80 pb..

Thanks Thierry

marcelm commented 9 years ago

You are absolutely correct. I have even fixed this already in the Git repository, but have not made a release, yet. I’ll publish an update soon.

thierrygosselin commented 9 years ago

strange, on one mac I have v.1.8 and on another, that I installed, I have v.1.9 that as the issue fix, and I'm pretty sure I installed both from git.. ?

marcelm commented 9 years ago

I changed the version from 1.8. to 1.9.dev0 just a few commits back (perhaps one or two weeks ago). Probably the 1.8 from Git is slightly older.

thierrygosselin commented 9 years ago

Ok, but when I use git clone .... I still get v.1.8. Have no idea how I got the v.1.9!

One last thing, I'm using cutadapt as pre-processing genotype-by-sequencing reads.

I used 2 enzymes (PstI and MspI) to reduce my genome.

When I use the option -a AGATCGGAAGAGCG, I get suggestion to add 3 bases (CCG), successively, until the 4 bases preceding the adapters are equal (btw, this is a very nice feature!). So basically what this tell me is that the second cut site of my enzyme is present > 96% of the time.

So strategically, to fully remove all the adapters and cut site, would you suggest using -a AGATCGGAAGAGCG and -a CCGAGATCGGAAGAGCG with 10% errors allowed or just one -a CCGAGATCGGAAGAGCG (cut site + adapter) with 20% errors allowed (to account for the times the CCG is not there or sequencing errors present) ?

Thanks Thierry

marcelm commented 9 years ago

Hi, great to hear the "your adapter may be incomplete" feature is being used :-).

I’m not familiar with genotyping by sequencing, so I’m not sure I’m qualified to answer. From what I see on the slide in this video, is it correct that the CCG you mention is part of an adapter with a sticky end that is complementary to the MspI restriction site? To me, it seems you would only be interested in those reads where you actually see the cut site and then using -a CCGAG... with the default of 10% errors would be the right thing. Why would you need to account for the cases where no CCG is there?

Also, and I guess I’m just not understanding how the protocol work, but the adapter sequence you are using is (the beginning of the reverse complement of) the TruSeq Universal Adapter, but I’d expect you would need to remove the TruSeq Indexed Adapter, which would be -a AGATCGGAAGAGCA (same, but A in the end). See also the section in the documentation.

marcelm commented 8 years ago

No reply, so closing. Feel free to reopen if this is still relevant.