smangul1 / rop

The Read Origin Protocol (ROP) is a computational protocol that aims to discover the source of all reads, including those originating from repeat sequences, recombinant B and T cell receptors, and microbial communities.
https://github.com/smangul1/rop/wiki
GNU General Public License v3.0
35 stars 14 forks source link

CIRI died because different read lengths #100

Closed wdecoster closed 8 years ago

wdecoster commented 8 years ago

I really like the idea of your tool and I am eager to try it on my transcriptomics dataset. Nice work.

It took a while to trace the error, but I could find the log messages in the end telling me that the CIRI step died because the reads have different lengths. Is this something we have to take care of ourselves?

In addition, I first downloaded the script from sourceforge (according to the tutorial: wget https://sourceforge.net/projects/rop2/files/latest/download/rop.tar.gz) but this installation missed seqclean. Cloning git was alright.

Another remark is that the python code appears rather messy to me, could use some more functions instead of long code blocks. Smal functions doing limited things will keep code ordered, and make debugging and developing easy. Lastly, if you can avoid using os.sysem() that would be great, subprocess should cover all possibilities and simultaneously allow you to read error codes and act accordingly. In both problems (CIRI sizing and absence of seqclean) the os.system(cmd) just died and the script only discovered that after finding that the output was empty.

smangul1 commented 8 years ago

Hi Wouter,

Thanks for the valuable feedback. What is the read length you have? I will check if CIRI can work on the variable length.

Also thanks for your suggestions on the code. We are working on the code improvement. Some improvements should be available in the next release.

Thanks, Serghei

wdecoster commented 8 years ago

Hi Serghei,

I performed trimming on my data (containing poly A tails for example) so the lengths are anything between 25 and 140 nucleotides (with the majority above 120).

I had a look at the ciri page on sourceforge and in the latest added readme file it is stated: "What's new in CIRI2? ....

  1. CIRI2 supports variable read lengths and can process combined data sets after trimming. .... "

So probably CIRI2 might solve the problem?

Regards, Wouter

wdecoster commented 8 years ago

Perhaps unrelated, but I would recommend using the shlex module for formatting your command for using subprocess.call. You can feed shlex.split the string you would execute on the command line without bothering about how to separate arguments.

smangul1 commented 8 years ago

Thanks for all your helpfull suggestions. Kevin and Linus are working to integrate you suggestions and update to CIRI2 into a new release coming soon. We will keep you posted. Serghei

smangul1 commented 8 years ago

Hi Wouter,

We have pushed new release 1.0.5. We followed your suggestions to start improve the code. More improvements will come in the next release. CIRI2 should be also available in the next release. For now i suggest 2 options: 1) don't trim so the length will b ethe same 2) run ROP with jobarray option, which will basically will run Step 1 and Step 2 and save after_step1_step2.fastq file which you can run manually through CIRI2

wdecoster commented 8 years ago

Dear Serghei,

Thanks for letting me know. I'll give it a try. I trimmed to remove polyA tails and polyG tails (because NextSeq sequencing was used). I think the second suggestion would be the most relevant.

Cheers, Wouter