rrwick / Porechop

adapter trimmer for Oxford Nanopore reads
GNU General Public License v3.0
323 stars 124 forks source link

editing adapters.py #4

Closed miles-gene closed 7 years ago

miles-gene commented 7 years ago

Hello I would like to use porechop to remove BAC vector sequence from my reads, or at least split the reads at the vector sequence which in most cases is somewhere in the middle of the read. I wondered if I edited the adapters.py file if I could achieve this. Porechop ran fine un-edited and found and trimmed many instances of the NSK007 Y adapter from the start of my reads. I replaced the NB12 adapter sequence with my own in adapters.py and ran porechop on the same fastq reads (many of which contain at least an 85% match to the 38 bp I specified. The output was the same. In fact it doesn't seem to matter what changes I make to adapters.py, the outcome is the same. For example, editing an adapter name is not reflected in the verbose output. I have never written a python script so I imagine I'm missing something basic.

thanks for reading Miles

rrwick commented 7 years ago

Hi Miles,

What you described should work. I suspect the problem may be that you're editing the Porechop source code, but when you run the program you're running a copy that was installed elsewhere on your computer. Did you install it with the python3 setup.py install command? The installed copy may be lacking your change.

You could either:

  1. Re-run the python3 setup.py install command after making your change.
  2. Run Porechop not from your installation directory but rather from the source directory: instead of calling porechop instead call path/to/Porechop/porechop-runner.py

One other thing to mention: Porechop is running on the assumption that it will fine adapter sequences at the start/end of reads. So its first step is to look at the start/end of the first handful of reads to find what adapters are there, then it looks more thoroughly in all reads for the adapters it found. So it may miss your BAC vector sequence if it is not at the start/end of your first reads.

The defaults are --end_size 100 --check_reads 1000, meaning it only scans for adapters in the first/last 100 bp of the first 1000 reads. If that misses the BAC vector, you could try larger values like --end_size 1000 --check_reads 10000 to make sure they are found (though it will take a bit longer).

Let me know how you go! I'll close this issue now, but if it still doesn't work for you, I can reopen it.

Ryan

rrwick commented 7 years ago

This did make me think: custom adapters would be a nice feature. E.g. so you could supply a FASTA file instead of hacking away at the source code. I'll make a new 'issue' with this feature request.

miles-gene commented 7 years ago

Hi Ryan

Thanks for your reply. You were right, I didn't realise the difference between source code and installed code, d'oh. I went ahead with my hacking and tried adding the entire 7506bp vector sequence to adapters.py. I got EOL while scanning string literal error messages so I gave up (the problem was the last ] I think) and added 40bp from the start and end of the vector sequence as two additional adapters. The input reads had been pre-processed by porechop to remove NSK007 adapters. After some experimentation I decided I needed to run porechop twice with the output from the first round feeding into the second. The second round had --end_size=8000. Round one split 1038/3261 reads and round 2 split 283/4295 and trimmed 7.1Mbp of sequence. Graphmap suggests that the resulting 4533 reads are vector free. Now I assemble. Thanks for you help and I think the custom adapters feature you mention would be a great help. I also wondered if appending the vector sequence to the start of the reads file would remove any uncertainty whether porechop could find the adapters it looks for during it's first pass, no need to change --end_size. Unfortunately I forgot to do this and I guess I got lucky and one of my first 1000 reads had my adapter sequence in the first 100 bases. Miles