simoncchu / REPdenovo

A tool to construct repeats directly from raw reads
MIT License
16 stars 3 forks source link

error in assembly #5

Closed Mobean closed 5 years ago

Mobean commented 7 years ago

I tried running the assembly portion and got an error that it doesn't recognize an option set in Jellyfish. This isn't something I designate so not really sure how to address. Any ideas?

/REPdenovo$ python ./main.py -c Assembly -g configFileFUK.txt -r Fuk_raw_reads.txt Calculating average coverage.... Running command: Running command: echo $(wc -l /TowerData/home/colleen/Fukomys_reads/SRR960036_1.fastq)... ... Running command: The number of reads in file /TowerData/home/colleen/Fukomys_reads/SRR960036_1.fastq is ('631034320 /TowerData/home/colleen/Fukomys_reads/SRR960036_1.fastq\n', None) ... Running command: Running command: echo $(wc -l /TowerData/home/colleen/Fukomys_reads/SRR960036_2.fastq)... ... Running command: The number of reads in file /TowerData/home/colleen/Fukomys_reads/SRR960036_2.fastq is ('631034320 /TowerData/home/colleen/Fukomys_reads/SRR9600362.fastq\n', None) ... Read coverage is: 18.0763463998 rm: cannot remove ‘./FukomysTE/Asm’: No such file or directory rm -r ./FukomysTE/Asm_ rm: cannot remove ‘./FukomysTE/.fa’: No such file or directory rm ./FukomysTE/.fa rm: cannot remove ‘./FukomysTE/.fastq’: No such file or directory rm ./FukomysTE/.fastq Working on 30mer now... Counting kmers... Running command: /usr/local/bin/jellyfish count -s 500M --bf-size 30M -C -m 30 -t 15 -o ./FukomysTE/mer_counts.jf -F 2 /TowerData/home/colleen/Fukomys_reads/SRR960036_1.fastq /TowerData/home/colleen/Fukomys_reads/SRR960036_2.fastq... count: unrecognized option '--bf-size' Use --usage or --help for some help Running command: /usr/local/bin/jellyfish dump -L 180 -o ./FukomysTE/dumped_30mers.txt ./FukomysTE/mer_counts.jf... terminate called after throwing an instance of 'jellyfish::compacted_hash::ErrorReading' what(): './FukomysTE/mer_counts.jf': File truncated Aborted (core dumped) Highest 30mer frequency is 0 Traceback (most recent call last): File "./main.py", line 408, in main_func(scommand,sfconfig,sfreads_list) File "./main.py", line 369, in main_func bpaired, sfleft_reads, sfright_reads, sfsingle_reads) File "/TowerData/home/colleen/REPdenovo/Assembly.py", line 146, in assembly with open(sconcate,"r") as fin: IOError: [Errno 2] No such file or directory: './FukomysTE/30mer.temp_contigs.fa'

simoncchu commented 7 years ago

@Mobean

I think the jellyfish you are using is old version. Please update to the latest version and try again. https://github.com/gmarcais/Jellyfish.

Also, please make sure the version of samtools is 1.3.1 or later. Some other users met the problems when using old version of samtools.

Another note is as mentioned in the wiki: For Velvet: "Caution: if you want to assemble k-mers that are longer than 30 bp, you need to recompile Velvet to let it work with longer sequence length. For example: make ’MAXKMERLENGTH=60’. This makes Velvet work for k-mer length up to 60."

Mobean commented 7 years ago

That was certainly the case. I have made sure that all of the dependencies were up to date (Jellyfish and samtools), re-installed REPdenovo. It was working but stalled once it got to jellyfish kmer=40.

Working on 40mer now... Counting kmers... Running command: /usr/local/bin/jellyfish count -s 500M --bf-size 30M -C -m 40 -t 15 -o ./FukomysTE/mer_counts.jf -F 2 /TowerData/home/colleen/Fukomys_reads/SRR960036_1.fastq /TowerData/home/colleen/Fukomys_reads/SRR960036_2.fastq...

It was stuck on this for over 24 hours. I assumed this because velvet hadn't assembled any so quit and recompiled velvet make ’MAXKMERLENGTH=60’. Now when I re-run, it does fine assembling 30mers, but crashes at assembly of 40mers.

Working on 40mer now... Highest 40mer frequency is 0 Traceback (most recent call last): File "./main.py", line 408, in main_func(scommand,sfconfig,sfreads_list) File "./main.py", line 369, in main_func bpaired, sfleft_reads, sfright_reads, sfsingle_reads) File "/TowerData/home/colleen/REPdenovo/Assembly.py", line 146, in assembly with open(sconcate,"r") as fin: IOError: [Errno 2] No such file or directory: './FukomysTE/40mer.temp_contigs.fa'

Not sure where to go from here. Any suggestions?

simoncchu commented 7 years ago

@Mobean It seems not velvet but jellyfish doesn't work.

Can you first try to run these commands (create a temp folder, and run the jellyfish to do 40mer counting ): mkdir temp cd temp /usr/local/bin/jellyfish count -s 500M --bf-size 30M -C -m 40 -t 15 -o ./FukomysTE/mer_counts.jf -F 2 /TowerData/home/colleen/Fukomys_reads/SRR960036_1.fastq /TowerData/home/colleen/Fukomys_reads/SRR960036_2.fastq

If this works well, then try to clean the old intermediate data by running the "Clean" option (Just change the "Assembly" option to "Clean"), and then run again: "python ./main.py -c Clean -g configuration-file-name -r raw-reads-file-name"

Please let me know whether either of these works.

Mobean commented 7 years ago

Jellyfish is running right now. Any guess on the range of how long it should take to run? At least, relative to the 30mer? As I mentioned it seemed to be stuck at this point when I terminated it before (did nothing for > 24 hrs). My guess was that it wouldn't take as long as the 30kmers since it would have fewer hits, but am I off base here?

When I run top, jellyfish is running but no output file has been created.

simoncchu commented 7 years ago

@Mobean How large the reads files? Usually Jellyfish will takes a long time if the coverage is high and genome size is large. But should be quicker than 30mers. To speed up you can use a lower coverage data, it will not affect much as we use a coverage relative threshold.

You can let it run at background with command like this: "nohup your-command &"

An optimal solution is using KMC2 (https://github.com/refresh-bio/KMC) to do kmer counting. It has two steps: "count" and "dump". Then covert the output to the format as the 30mers.

Mobean commented 7 years ago

The read file is around 45Gbases/ 31 Gbytes. This is my first time trying your script so I only included one paired end seq run to see what I got and potentially work out bugs.

Glad to know I wasn't off thinking that it should go faster than 30. When I originally started the python script, it was progressing steadily through the 30 mers. It only stalled once it started working on 40s. I have been monitoring the activity with top, and the %CPU has dropped substantially, from 98% to 5%. It is running on a remote server, I have been logging in with screen to check progress. Any ideas?

simoncchu commented 7 years ago

@Mobean How long it takes for the counting of 30mers? The read size is fine if uncompressed, but Jellyfish do takes time and resources. That's why every time when re-run the program, I keep the kmer files unchanged, unless run the "Clean" option.

I will add an option for using KMC to the tool later. You can try to run it, it is faster and memory efficient. But at the time when I develop the tool, it has not been released, so not integrated in the tool.

Mobean commented 7 years ago

Just to give you an update, I cleared everything as you suggested and started the analysis again. It took about an hour for the 30mer counting and assembly (for Jellyfish and velvet). It is now starting the 40mer... I will let you know if it gets through it this time.

Thanks for all your help!

Mobean commented 7 years ago

Update: I left if running over the long weekend (5 days) and rather than stalling as I originally thought, it seems to be stuck in a loop. It says it is working on 40mer but then works on the same 30mer data. I have pasted a portion of the repeated analysis below. Any suggestions?

Working on 40mer now... Counting kmers... Running command: /usr/local/bin/jellyfish count -s 500M --bf-size 30M -C -m 40 -t 15 -o ./FukomysTE/mer_counts.jf -F 2 /TowerData/home/colleen/Fukomys_reads/SRR960036_1.fastq /TowerData/home/colleen/Fukomys_reads/SRR960036_2.fastq... Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_46675_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_46675_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:2333 and 116685 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_23337_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_23337_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:1166 and 58340 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_11668_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_11668_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:583 and 29170 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_5834_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_5834_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:291 and 14585 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_2917_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_2917_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:180 and 7290 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_1458_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_1458_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:180 and 3645 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_729_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_729_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:180 and 1820 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_364_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_364_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:180 and 910 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_182_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_182_29 -min_contig_lgth 100 -unused_reads yes... Working on 40mer now... Counting kmers... Running command: /usr/local/bin/jellyfish count -s 500M --bf-size 30M -C -m 40 -t 15 -o ./FukomysTE/mer_counts.jf -F 2 /TowerData/home/colleen/Fukomys_reads/SRR960036_1.fastq /TowerData/home/colleen/Fukomys_reads/SRR960036_2.fastq... Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_46675_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_46675_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:2333 and 116685 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_23337_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_23337_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:1166 and 58340 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_11668_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_11668_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:583 and 29170 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_5834_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_5834_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:291 and 14585 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_2917_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_2917_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:180 and 7290 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_1458_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_1458_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:180 and 3645 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_729_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_729_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:180 and 1820 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_364_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_364_29 -min_contig_lgth 100 -unused_reads yes... 30mer frequency are:180 and 910 Assembly high frequency kmers ... Running command: /usr/local/bin/velveth ./FukomysTE/Asm_30_182_29 29 -fastq -short ./FukomysTE/kmers_fq.fastq... Running command: /usr/local/bin/velvetg ./FukomysTE/Asm_30_182_29 -min_contig_lgth 100 -unused_reads yes... Working on 40mer now... Counting kmers... Running command: /usr/local/bin/jellyfish count -s 500M --bf-size 30M -C -m 40 -t 15 -o ./FukomysTE/mer_counts.jf -F 2 /TowerData/home/colleen/Fukomys_reads/SRR960036_1.fastq /TowerData/home/colleen/Fukomys_reads/SRR960036_2.fastq...

simoncchu commented 7 years ago

Would you please show me your configure file?

Mobean commented 7 years ago

I left most settings as default.

On Jan 3, 2017, at 11:10 AM, Reedwarbler notifications@github.com wrote:

Would you please show me your configure file?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Reedwarbler/REPdenovo/issues/5#issuecomment-270150738, or mute the thread https://github.com/notifications/unsubscribe-auth/ADohexkEaxwa4jFbt7zcqUo9k3IfBwN9ks5rOnLigaJpZM4K_Z7f.

MIN_REPEAT_FREQ 10 RANGE_ASM_FREQ_DEC 2 RANGE_ASM_FREQ_GAP 0.8 K_MIN 30 K_MAX 50 K_INC 10 K_DFT 30 READ_LENGTH 150 GENOME_LENGTH 2618204639 MIN_CONTIG_LENGTH 100 ASM_NODE_LENGTH_OFFSET -1 IS_DUPLICATE_REPEATS 0.85 COV_DIFF_CUTOFF 0.5 MIN_SUPPORT_PAIRS 20 MIN_FULLY_MAP_RATIO 0.2 TR_SIMILARITY 0.85 TREADS 15 BWA_PATH /usr/local/bin/ SAMTOOLS_PATH /usr/local/bin/ JELLYFISH_PATH /usr/local/bin/ VELVET_PATH /usr/local/bin/ REFINER_PATH ./TERefiner_1 CONTIGS_MERGER_PATH ./ContigsMerger OUTPUT_FOLDER ./FukomysTE VERBOSE 1

simoncchu commented 7 years ago

@Mobean It's really wired for me. I just test the tool again and it works well. Not sure how this loop things happen. Would it be possible for you just down sampling the reads to a smaller size (say 10G or smaller for all paired-end reads) and test? It will be quick to test the tool. For down sampling, you can just run seqtk ( https://github.com/lh3/seqtk ) like this: seqtk sample read1.fa.gz 0.2 > sub1.fa seqtk sample read2.fa.gz 0.2 > sub2.fa

Mobean commented 7 years ago

Update: I downsampled and now it seems to be working fine with the smaller files. It made it through the 40mers and has moved on to 50mers. If all goes well, is there anything I should do to get the larger dataset to work?

simoncchu commented 7 years ago

I think it also works for larger dataset. I tried on a much larger genome before. Also it doesn't mean higher coverage data will generate better results. You can provides a say 10X coverage data as input and run the tool. It doesn't differ too much for using a much higher coverage.

Mobean commented 7 years ago

Ok, it made it through the assembly and then everything crashed.

Running command: Running command: /usr/local/bin/ faidx ./FukomysTE/contigs.fa ... /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied Running command: ./TERefiner_1 -P -b ./FukomysTE/contigs.fa.itself.sort.bam -r ./FukomysTE/contigs.fa -o ./FukomysTE/contigs.fa_no_dup.fa -c 0.9 -g ... Index file not found, now create it!!! Index file cannot be created!!! Bamtools ERROR: could not open input BAM file: ./FukomysTE/contigs.fa.itself.sort.bam rm: cannot remove ‘./FukomysTE/contigs.fa.sa’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa.pac’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa.bwt’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa.ann’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa.amb’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa.itself.bam’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa.itself.bam’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa.itself.bam’: No such file or directory Running command: ./ContigsMerger -s 0.2 -i1 -6.0 -i2 -6.0 -x 15 -y 50 -k 10 -t 15 -m 1 -o ./FukomysTE/contigs.fa_no_dup.fa.merge.info ./FukomysTE/contigs.fa_no_dup.fa > ./FukomysTE/contigs.fa_no_dup.fa.merged.fa ... Running command: Running command: /usr/local/bin/ faidx ./FukomysTE/contigs.fa_no_dup.fa.merged.fa ... /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied Running command: ./TERefiner_1 -P -b ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.itself.sort.bam -r ./FukomysTE/contigs.fa_no_dup.fa.merged.fa -o ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa -c 0.85 -g ... Index file not found, now create it!!! Index file cannot be created!!! Bamtools ERROR: could not open input BAM file: ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.itself.sort.bam rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.sa’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.pac’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.bwt’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.ann’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.amb’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.itself.bam’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.itself.bam’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.itself.bam’: No such file or directory Running command: Running command: /usr/local/bin/ faidx ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa ... /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied Running command: ./TERefiner_1 -P -b ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.itself.sort.bam -r ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa -o ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.no_contained.fa -c 0.85 ... Index file not found, now create it!!! Index file cannot be created!!! Bamtools ERROR: could not open input BAM file: ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.itself.sort.bam rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.sa’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.pac’: No such file or directory Index file cannot be created!!! Bamtools ERROR: could not open input BAM file: ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.itself.sort.bam rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.sa’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.pac’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.bwt’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.ann’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.amb’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.itself.bam’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.itself.bam’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.itself.bam’: No such file or directory Running command: Running command: /usr/local/bin/ faidx ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa ... /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied /bin/sh: 1: /usr/local/bin/: Permission denied Running command: ./TERefiner_1 -P -b ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.itself.sort.bam -r ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa -o ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.no_contained.fa -c 0.85 ... Index file not found, now create it!!! Index file cannot be created!!! Bamtools ERROR: could not open input BAM file: ./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.itself.sort.bam rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.sa’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.pac’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.bwt’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.ann’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.amb’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.itself.bam’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.itself.bam’: No such file or directory rm: cannot remove ‘./FukomysTE/contigs.fa_no_dup.fa.merged.fa.no_dup.fa.itself.bam’: No such file or directory Traceback (most recent call last): File "./main.py", line 408, in main_func(scommand,sfconfig,sfreads_list) File "./main.py", line 371, in main_func RM_DUP_BF_MERGE_CUTOFF, RM_DUP_AF_MERGE_CUTOFF) File "/TowerData/home/colleen/REPdenovo/MergeContigs.py", line 91, in merge_contigs os.rename(foutput,fout_folder+"contigs.fa") OSError: [Errno 2] No such file or directory

simoncchu commented 7 years ago

@Mobean please change the BWA_PATH and SAMTOOLS_PATH to:

BWA_PATH /usr/local/bin/bwa SAMTOOLS_PATH /usr/local/bin/samtools

Mobean commented 7 years ago

Thanks! I made the changes and re-ran. How can I determine if it ran correctly? It made it through TERefiner_1 2X but just stopped at 3rd. I tried running the Scaffolding and got this

python ./main.py -c Scaffolding -g /TowerData/home/colleen/Fukomys_reads/configFileFUK.txt -r TowerData/home/colleen/Fukomys_reads/SUBfukomysPairedEndRead.txt Traceback (most recent call last): File "./main.py", line 408, in main_func(scommand,sfconfig,sfreads_list) File "./main.py", line 340, in main_func assert os.path.exists(sfreads_list),"raw reads list file is not found" AssertionError: raw reads list file is not found

Mobean commented 7 years ago

Hello @Reedwarbler,

I re-ran the scaffolding and it appears to have worked. There are contigs in the contig.fa file, but there doesn't appear to be a file with the contig pair info (0_contig_pairs_info.txt_temp_ContigsConnections.txt file: empty file). I am re-running to see if something may have happened during run.

What would you suggest for searching data. I have multiple paired end runs with different insert sizes. My first try was just using a single paired-end run. I saw in a previous post that you said if runs are included the program only uses the first run to assemble. Would you recommend any size insert over another?

Thanks!

simoncchu commented 7 years ago

Hi @Mobean

I would suggest use the pair-end reads with low coverage (say ~10x). The insert size should not affect much.

Mobean commented 7 years ago

@Reedwarbler I re-ran the Scaffolding and there was no X contig pairs info.txt cov info with cutoff.txt file produced. There were some other files produced but they were empty 0_contig_pairs_info.txt_discarded.txt
0_contig_pairs_info.txt_merged.txt
0_contig_pairs_info.txt_temp_ContigsConnections.txt

Any suggestion on how to get this to produce the repeat coverage information?

I opened up the contig.fa file and searched the first 10 using repeatmasker to see if these contigs were repetitive DNA. Repetitive DNA was only identified in 3 of the first 10.

simoncchu commented 7 years ago

@Mobean I repeat the same problem. Will check the code and get back to you.

simoncchu commented 7 years ago

Hi @Mobean , I have fix the bug.

To update, you can either update "refiner.cpp", "Coverage.cpp" and "bam_parse.cpp" under TERefiner folder, and then compile and copy the generated "TERefiner_1" under the same folder as "main.py". Or you can directly use the already compiled "TERefiner_1" (need to run "chmod +x TERefiner_1") under folder TERefiner.

Please let me know whether it works. Thank you.

jsal873 commented 7 years ago

Hello @Reedwarbler I tried running the Assembly part, but it stops running after a couple of minutes. It always stops after running through TERefiner_1. I have attached the command line output here: run.log.txt. I suspect that the binary for TERefiner_1 is failing to run. I have tried to run both the binary on the main REPdenovo directory, and the one in REPdenovo/TERefiner, but I get the same error with both. I don't think this error is related to permissions, as I have also tried changing the permissions of the binary file. I have also tried to compile TERefiner_1 from source, however I haven't been successful, as I get the following output:

g++ -O3 -Wall -static -I../bamtools-master/include -L../bamtools-master/lib -Wl,-rpath,../bamtools-master/lib -c bam_parse.cpp In file included from bam_parse.cpp:1:0: bam_parse.h:4:27: fatal error: api/BamReader.h: No such file or directory compilation terminated. Makefile:19: recipe for target 'bam_parse.o' failed make: *** [bam_parse.o] Error 1

The compilation error I get appears to be related to the bamtools libraries not being found. I believe I have properly installed Bamtools. However the bamtools directory in my computer that contains those libraries is called /bamtools/ and not /bamtools-master/. So I believe the error may be related to a hard-coded path for bamtools somewhere in the source code.

In your previous comment you mention to update "refiner.cpp", "Coverage.cpp" and "bam_parse.cpp". What needs to be updated on these files?

Config file: test_config.txt

data input: test_sample_input.txt

Thank you in advance for your help!

I'm really looking forward to properly running your cool program!

simoncchu commented 7 years ago

@Mobean

There is a "Makefile" file under folder REPdenovo/TERefiner, you need to change the first two lines to the path under your server. Then re-compile TERefiner, and copy the compiled "TERefiner_1" to replace the one under folder "REPdenovo".

I am sorry for these tedious steps. I am sure will release a much more user-friendly version soon.

jsal873 commented 7 years ago

Hello again @Reedwarbler, Thank you for your lightning fast response! I'm impressed! Could you please also provide me with the command line output of a successful run?

simoncchu commented 7 years ago

@Mobean I usually check the outputed files to see whether the running is successful or not. For the "Assembly" option usually when "contigs.fa" is constructed with some contigs has id like ">NEW_CONTIG_MERGE_xx" then this step should be successful.

For the "Scaffolding" option, two files "X_contig_pairs_info.txt_cov_info_with_cutoff.txt" and "X_contig_pairs_info.txt_after_filter_cov_info.txt" will be constructed. Both files can be considered as the copy number of each repeat. The difference comes from: after align the reads back to the contigs, some reads are with low mapping quality (mainly because can be mapped to several contigs). And the "X_contig_pairs_info.txt_cov_info_with_cutoff.txt" file is generated with all the reads. While the "X_contig_pairs_info.txt_after_filter_cov_info.txt" is calculated by filter out those reads with mapping quality lower than 30.

For both files, the first three columns are: contig_id, contig_length, contig_average_depth. Also the last parameter in "X_contig_pairs_info.txt_after_filter_cov_info.txt" is the "hited_length", however I noticed I forget to initialize the value, and as a result, this value is accumulated. I will fix this in the new release.

simoncchu commented 7 years ago

@jsal873 I am sorry, I though your were @Mobean, and @ the wrong person. Sorry for both @Mobean @jsal873

jsal873 commented 7 years ago

@Reedwarbler hahaha no worries, thank you for your quick responses, you are awesome! I hope you don't mind if I ask some additional questions: Is the contigs.fa file generated gradually? could it be generated if the program crashes mid run? Or is it generated only if the whole program runs successfully?

Does the parameter "Genome_length" refer to the actual genome size? What if I have low coverage data (i.e. <1X)? should I use the estimated genome length from my coverage?

For the parameter "READ_LENGTH", what If I have variable read lengths in my data? should I use the max read length? min read length? mean read length?

Again, Thank you very much!

simoncchu commented 7 years ago

@jsal873

  1. Is the contigs.fa file generated gradually? could it be generated if the program crashes mid run? Or is it generated only if the whole program runs successfully? Yes, it is generated gradually. For the first main step is directly assembled from kmers, then the "Merging" step will add more merged ones.

  2. Does the parameter "Genome_length" refer to the actual genome size? Yes

  3. What if I have low coverage data (i.e. <1X)? should I use the estimated genome length from my coverage? The genome length is only used for calculate the coverage, if you have set the "READ_DEPTH" (the coverage) parameter, then this one is useless.

  4. For the parameter "READ_LENGTH", what If I have variable read lengths in my data? should I use the max read length? min read length? mean read length? This program is assume the read length is same. I think most of the reads should be of the same length. Except some like mate-paired end reads which need to trim adapter. If you really need to use reads with varied length, I would suggest to use the maximum length.

Mobean commented 7 years ago

@Reedwarbler @jsal873 Thanks for including me in the feed. I think we are working on a bit of the same issues so this is really helpful. I updated the 3 files and re-compiled as you suggested. I am currently re-running so I will let you know has soon as it finishes.

I am a bit concerned that it is stalling on something. On the log screen, I did notice it progressed through TERefiner_1 really quickly, and then moved on to ContigsMerger. When I first checked to make sure ContigsMerger was running (htop). Initially it running on multiple cores but the activity has dropped substantially to a single core and sometimes appears to be completely inactive. None of the files have updated for over 30 min. Any idea what could be going on? Thanks!

simoncchu commented 7 years ago

@Mobean the merging step is most time consuming step for the whole pipeline. Mainly because it need to check pairwise overlap between contigs and traverse the graph. If the genome is quite repetitive then it may take lots of time (I used to assemble a quite repetitive (also large) genome, and it takes several days to finish). I have done several parallelization for some steps to speed up but for some steps it hard to parallel. I may re-design the merging step later. But for now, I would suggest try to set the "MIN_REPEAT_FREQ" to a larger value (by default 10) to speed up.