institut-de-genomique / NaS

NaS is a hybrid approach developped to take advantage of data generated using MinION device. We combine Illumina and Oxford Nanopore technologies to produce NaS (Nanopore Synthetic-long) reads of up to 60 kb that aligned with no error to the reference genome and spanned repetitive regions.
http://www.genoscope.cns.fr/nas/
15 stars 2 forks source link

Does NaS require all reads to be of the same size? #9

Closed zmatosevic closed 7 years ago

zmatosevic commented 7 years ago

For my Illumina input, I used trimmed read which are of various sizes. The original reads are all 251 nucleotides long.

I will list here the errors i got in hope that you can see where the problem lies.

This was in my error file:

cat: /....../Results//assemblies/*/NaS_hqctg_reads_final.fa: No such file or directory awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted awk: cmd. line:1: fatal: division by zero attempted gawk: cmd. line:1: fatal: division by zero attempted

This was in my output file:

Command : /..../NaS/NaS_v2/NaS_wrapped --fq1 /.../NaS_workflow/Esu2015_Nanopore_L_XL/Esu2015_paired_1_edited.fq --fq2 /..../NaS_workflow/Esu2015_Nanopore_L_XL/Esu2015_paired_2_edited.fq --nano /..../NaS_workflow/Esu2015_Nanopore_L_XL/Nanopore_L_XL_edited.fasta --out /..../NaS_workflow/Esu2015_Nanopore_L_XL/Results/ --mode sensitive --nb_proc 34 --rmtmp no checking for "parallel" ... ok checking for "lastal" ... ok checking for "blat" ... ok checking for "runAssembly" ... ok checking for "exonerate" ... ok [Mon Jul 10 14:40:54 CEST 2017] Create output directory : /..../NaS_workflow/Esu2015_Nanopore_L_XL/Results/ [Mon Jul 10 14:40:54 CEST 2017] Create fasta file from fastq... [Mon Jul 10 14:43:19 CEST 2017] Alignement step in sensitive mode... [Mon Jul 10 14:52:19 CEST 2017] Convert maf file to psl file... [Mon Jul 10 14:52:19 CEST 2017] Select reads... [Mon Jul 10 14:52:19 CEST 2017] Retrieve similar reads... [Mon Jul 10 14:52:19 CEST 2017] Generate NaS reads... [Mon Jul 10 14:52:20 CEST 2017] Untangle complex NaS reads... [Mon Jul 10 14:52:20 CEST 2017] Generate statistics... NbReads= 92624 CumulativeSize= 364755838 N50size= 6394 minSize= 500 maxSize= 1278193 avgSize= 3938.03 => /......./NaS_workflow/Esu2015_Nanopore_L_XL/Results//NANO_reads.stats NbReads= CumulativeSize= N50size= minSize= maxSize= avgSize= => /....../NaS_workflow/Esu2015_Nanopore_L_XL/Results//NaS_hqctg_reads.stats [Mon Jul 10 14:52:25 CEST 2017] Temporary files were not deleted because errors occured... [Mon Jul 10 14:52:25 CEST 2017] Total execution time with 34 core(s) : [00:11:31]

The assemblies, reads and reads2 folders are empty.

This is what the NANO_reads_stats file looks like:

-------------------- GLOBAL STATISTICS -------------------

N50 size= 6394 number= 14243 N80 size= 2851 number= 39400 N90 size= 1776 number= 55575 Assembly size= 364755838 number= 92624 minSize= 500 maxSize= 1278193 averageSize= 3938.03

----------------------------------------------------------

-------------------- SIZE REPARTITION --------------------

Size= >= 1000000 Number= 2 (0.00) CumulativeSize= 2339276 (0.64) Size= >= 100000 Number= 132 (0.14) CumulativeSize= 31848093 (8.73) Size= >= 50000 Number= 303 (0.33) CumulativeSize= 43551530 (11.94) Size= >= 10000 Number= 4359 (4.71) CumulativeSize= 104816580 (28.74) Size= >= 5000 Number= 21609 (23.33) CumulativeSize= 224059482 (61.43) Size= >= 1500 Number= 60424 (65.24) CumulativeSize= 336212372 (92.17) Size= >= 1000 Number= 71555 (77.25) CumulativeSize= 349926828 (95.93) Size= >= 500 Number= 92624 (100.00) CumulativeSize= 364755838 (100.00)

----------------------------------------------------------

-------------------- BASE COMPOSITION --------------------

NumberOfN= (0%) NumberOfGC= 173635233 (47.6%)

----------------------------------------------------------

The NaS_hqctg_reads.fa file is empty and the NaS_hqctg_reads.stats file looks like this:

-------------------- GLOBAL STATISTICS -------------------

----------------------------------------------------------

-------------------- SIZE REPARTITION --------------------

----------------------------------------------------------

-------------------- BASE COMPOSITION --------------------

----------------------------------------------------------

The selectReads.stderr and selectReads.stdout files are also empty. maf-converter.stderr and maf-converter.stdout are also empty.

The last-alignment.stderr file constists of these 2 lines repeated many times: lastal: bad symbol in sequence: 1 lastal: bad symbol in sequence: 2

I think this last might be a hint to what could be going on.

The psl folder contains only the file last-alignment.A1.B1.E40.job1.psl, but this file is empty.

I hope you can help me to get NaS up and running.

fxbabin commented 7 years ago

Hello zmatosevic,

I think your problem could be linked to bad formats in sequence names. Your sequence names must end with "/1" or "/2". you can use the following command to add /1 (for your forward fastq file) cat forward.fastq | awk '{if(NR%4==1){print $0"/1"}else{print $0}}' > good_format_forward.fastq and the following one for the reverse cat reverse.fastq | awk '{if(NR%4==1){print $0"/2"}else{print $0}}' > good_format_reverse.fastq

hope this will help,

tell us if you encounter any other problem

zmatosevic commented 7 years ago

I have adapted my sequences names, in the illumina as well as the nanopore reads. Do you have an idea what else might be causing it?

However, just to be sure, I will copy some reads here:

@MISEQ:282:000000000-AACHF:1:1101:16336:1936 1:N:0:/1 CGAACGGGGCGGATGTCCGGTCGTCCGCGAAGAACAGGCCGTGCCCGCACAGCCGGTGGCTGAAGCGCTGGCACCCATCCGGCACGCGCTGGATGCGCACACGGTTCGCGCCATCGGTGTGGTGCACGAGTTCCTCGGCGTCGGATCGGCGGGCGGCGCGCCGGGGCAGCGTCGGCAGCCGCCCCGGGGAAAAAGGTTTGCTGGGGTTTATACCCAGCTGGGCGGGAGTGTCCCAGCCGCGCGGCCCCCCG + BBBBABBBBBDBGGGFGGGGGGGGFGGCFEE?EHHGHEEECEEFFHGGGGGHGHGEGGGHHGHHHHGGEGGGHGGDGHHHGGGGGGGGGGGGHHHGGGGGGHGGEEEGGGGBEGGGGGGFFDFFFFFFFF.EFFFFFAEBA=;-BDFFFFFF-------9-----9--9.9..;--9-..---;---;9-.....;/B/./9//.9.-.//;;//;../:;9.-----;;///9//.9---9----9;--- @MISEQ:282:000000000-AACHF:1:1101:13059:1937 1:N:0:/1 CACACACACACACACACACACACACACAAACACACATACACAAACACATATACGCATACTCACTCCCCGAGTATCACTATATTCCACAAAATCCATCCACTACCCTCTGTTTCACATAGAAAATATTAGTCATACAATCAAAGCGTGTGACTCTGTATGCAAATATCTTCTATAAATTTGATATTATGATCGACACTACTCCTATAACTTGATATTTACCCACATGGTCTAATAATCCCTCCTAGCATATT + BBAABBABBBBBFEEEGGGGGCEEAGGG222A22BE33B3FF111111335551111BG535@552110//114BFGBB4444444B311133B3BD0334332?3?23?4444B4B3BF3B2323B333333332B3222222////21<111@1111111111@F<1<1=1<1=1=111=11>1100..-.<0//00/00:;00:;0000;0;0;0/09900/;;90009009000/./00/00000

And nanopore header:

aaced372-4e44-4f8f-8665-4521ceaf3305_Basecall_2D_template

2017-07-10 23:32 GMT+02:00 François-Xavier Babin notifications@github.com:

Hello zmatosevic,

I think your problem could be linked to bad formats in sequence names. Your sequence names must end with "/1" or "/2". you can use the following command to add /1 (for your forward fastq file) cat forward.fastq | awk '{if(NR%4==1){print $0"/1"}else{print $0}}' > good_format_forward.fastq and the following one for the reverse cat reverse.fastq | awk '{if(NR%4==1){print $0"/1"}else{print $0}}' > good_format_reverse.fastq

hope this will help,

tell us if you encounter any other problem

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/institut-de-genomique/NaS/issues/9#issuecomment-314254879, or mute the thread https://github.com/notifications/unsubscribe-auth/Acp9XhaOgkm__AA47zRusdw8EY3svyvsks5sMphxgaJpZM4OS1JX .

fxbabin commented 7 years ago

I am sorry, i cannot check many things for now (i am not in the lab for 2 weeks). I will check your problem when i come back. I would suggest to try NaS in the fast mode (using blat). Indeed the problem seems to come from the alignment step using last.

zmatosevic commented 7 years ago

I checked again and it seems that you were right about the wrong read name formatting - the /1 and /2 signs were in place, but there was a space sign in the read names and I think this was causing the problems.

Thank you for the help and fast responses.