sing-group / bicycle

bicycle (bisulfite-based methylcytosine caller) is a next-generation sequencing bioinformatics pipeline able to perform a full DNA methylation level analysis
http://www.sing-group.org/bicycle
GNU Lesser General Public License v3.0
8 stars 1 forks source link

align with mutli thread issue #3

Closed freedomq8 closed 4 years ago

freedomq8 commented 4 years ago

Hi there,

This is my first time using bicycle and i noticed with i align with more than three threads -t 4 i get a java error

Exception in thread "Thread-18" [INFO] BowtieAlignment: Start read feeding to alignment against workingDirectory/ucsc.hg19.fa_bisulfited_GA. Log file: output/bisulfited_CT_8209_against_ucsc.hg19.fa_CRICK.sam_p_1.log java.lang.RuntimeException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at es.cnio.bioinfo.bicycle.operations.BowtieAlignment$1AlignerThread$1.run(BowtieAlignment.java:786) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1931) Nevertheless, the mapping keep going but it get stuck in the end without moving to the next sample.

any suggestions of what might be going wrong ?

Thanks

lipido commented 4 years ago

Hi @freedomq8!

First of all. Thank you for considering bicycle for your analysis.

We have to see if this is related to multithreading so, it is non-deterministic. So, could yo try to launch it a few times? Did you get always the same exception?

In any case, it would be very useful if you could send me some of your input data in order to reproduce the problem by myself.

freedomq8 commented 4 years ago

Thanks for the fast reply,

I did run the align few times and it always stuck at the end of the analysis for the first sample (however the therads used were in -t 21 , t20). After that i used -t 4 and saw the error above and i assumed it has something to do with it so i ran it with -t 3 with no error and that does the trick. Also when compared the number of the files with the successful alignment with -t 3. The end results for each sample is two files compared to unsuccessful run which were around 20 with sam files one of them titled Watson ( i forgot the other as i delete those in order to work on the same directory). if you have any suggestions i would be more than happy to try them out.

Regarding the data its patients data so i prefer not.

N.B although i used -t 3 threads i only see one thread is working ? also i have 24 threads available. Also want to point out that i used zipped fastq files intially and there was no warning when i run any command but nothing was happening so maybe add an error for the user to adjust for that.

Thanks

lipido commented 4 years ago

Please, try with -t 8 a few times and give me the output in the console, the command you used, the bowtie version, the type of samples (directional/non-directional paired/un-paired). Also try with -t 1.

Which version of bicycle are you using? Do you use the docker version?

freedomq8 commented 4 years ago

paired ends didnt get the directional bit. this is the first time i analyse methyseq. does the lab tech know this or is it kit specific ours is illumina truseq-methyl-capture. I just downloaded docker and will give it a go and see if i get the same results. thanks for our suggestions.

Command used

/home/daruma/soft/bicycle-1.8.1/cmd/bicycle align -p /media/daruma/2D2F5F2D0A4312BB/methy -t 21 --bowtie2-quals phred33 --bowtie2-I 0 --bowtie2-X 500

lipido commented 4 years ago

Ok, I think it could be directional (default option in bicycle). So the protocol (directional, paired-end) is the same as for our case study in bicycle web page: https://sing-group.org/bicycle/casestudy.html

Could you give it a try, but by changing the -t option to in the alignment to -t 21, and see if it finishes correctly?

freedomq8 commented 4 years ago

okay, i had a go with -t 8 first and it suck, however, i looked at the log files in the output directory and found the last two logs files for WATSON and CRICK with this error WATSON.sam_p_3.log and CRICK.sam_p_3.log with the same message Time loading reference: 00:00:00 Time loading forward index: 00:00:00 Time loading mirror index: 00:00:00 Error: Read SN1044:330:ccdgyacxx:1:1114:19705:46286_1:N:0:CGACGT||TTTCGTTTTTTATATGTAGTTGTTTTTTAGGGGAAGTGAGAATTGGTTTAAGTTATAGGTGATTATGATAGGATTTT||@SN1044:330:ccdgyacxx:1:1114:18581:46371 2:N:0:CGATGT has more read characters than quality values. terminate called after throwing an instance of 'int' Aborted (core dumped) (ERR): bowtie2-align exited with value 134

any suggestions ? could it be Illumina specific FASTQ encoding ? could it be adapters ? and why it was successful with less -t 3 but not more ?

lipido commented 4 years ago

Okay, good point. Is this a file of your dataset? It seems that it is a bug with bicycle when reading your fastq files in multithreading mode. Bicycle splits your fastq files on the fly before passing the reads trough bowtie2 in multithreading mode. I think that this splitting may contain a bug.

In order to reproduce the bug in my environment, it would be very useful if you could give me the fastq files of the failing sample. I don't need the metadata of the sample, nor its name or condition. If you need it, you could replace all sequences by a fixed random sequence before sending it to me? The important is that the files have the same read headings and qualities, as well as the exact number of lines in the fastq files.

To do so, first hide your sequences by a fixed one by doing this with all your fastq files:

cat my-sample-pair-1.fastq | awk '{if (NR%4 -2 ==0) {print "TTTCGTTTTTTATATGTAGTTGTTTTTTAGGGGAAGTGAGAATTGGTTTAAGTTATAGGTGATTATGATAGGATTTT"} else {print}}' > my-sample-fixedseq-pair-1.fastq cat my-sample-pair-2.fastq | awk '{if (NR%4 -2 ==0) {print "TTTCGTTTTTTATATGTAGTTGTTTTTTAGGGGAAGTGAGAATTGGTTTAAGTTATAGGTGATTATGATAGGATTTT"} else {print}}' > my-sample-fixedseq-pair-2.fastq

Next, try again with -t 3, and if it gets stucked, send these fastq files to me.

Thank you!

freedomq8 commented 4 years ago

Thanks for the speedy replies as well as your support

I tried different bowtie2 versions as well as docker image all had the same problem with more than 3 threads. Nonetheless, i manage to run the demo samples e.coli just fine and tried the human samples but didnt finish it (it seems fine though). I tried to use the a difference reference supplied with human case study, but same results. So its could be what you mentioned above. give it a go and let me know. it could be an adapter that needs cleaning ( if you know a good tool for that let me know) https://www.dropbox.com/s/g3dae24vzj1n9oh/my-sample-fixedseq.zip?dl=0

lipido commented 4 years ago

Thank you very much. I will try to reproduce your bug now (downloading, I have a slow connection in my home). But some additional questions:

freedomq8 commented 4 years ago

thanks

lipido commented 4 years ago

I am running bicycle in order to reproduce the bug. By the way, please, Could you also provide the new-project step parameters? How did you specify the the --paired-mate1-regexp/-m parameter?

freedomq8 commented 4 years ago

Below are the command i used.

/home/daruma/soft/bicycle-1.8.1/cmd/bicycle create-project -p /media/daruma/7B41D21E28D6D2DB/methyl/D1 -r /media/daruma/7B41D21E28D6D2DB/methyl/referenceGenome/ -f /media/daruma/7B41D21E28D6D2DB/methyl/R1/ -b2 /home/daruma/soft/bowtie2-2.4.1-linux-x86_64/ -s /home/daruma/soft/samtools-1.9/ --paired-mate1-regexp _1.fastq

Now i tried to analyse each sample with two threads running them as different project and hopefully create a new project and put them all in the output file as work around ? what do you think of this approach?

Also another question regarding DMR analysis is there a way to get the annotation for hg19. I do have the Illumina manifest files

one bed file with this structure: chr1 17363 17575 chr1_17364-17575 chr1 91548 91553 chr1_91549-91553

And another txt file with this structure: Chr Start Stop UCSC_CpG_Islands_Name Relation_to_UCSC_CpG_Island chr1 17363 17575 chr1 91548 91553 chr1 91570 91587 chr1 129279 129380 chr1 135141 135255 chr1:135124-135563 Island

based on the case study demo analysis, i see some gene names. could you direct me how to get annotation file (hg19) to include for the analysis ?

lipido commented 4 years ago

Many thanks for the command,

I think your approach is feasible as workaround, because all files contain the sample name, so it is possible to merge the workingDirectory and outputs.

Regarding DMR, could you open another issue with the question, I will try to answer there.

lipido commented 4 years ago

I can confirm the bug. I am working in the fix...

lipido commented 4 years ago

I have tried to fix the bug. Could you please try to rerun the alignment with this version and with -t 3?

Download fixed version: https://www.sing-group.org/owncloud/index.php/s/bgKIriCdO90Udfv

When you run this version please ensure that you see the 1.8.2-SNAPSHOT version in the bicycle welcome messages.

freedomq8 commented 4 years ago

its working well now.

lipido commented 4 years ago

Fixed in version 1.8.2