ksahlin / IsoCon

Derives consensus sequences from a set of long noisy reads by clustering and error correction.
GNU General Public License v3.0
13 stars 1 forks source link

Can we used IsoCon for a organism with a large polyploid genome / PBS: job killed: mem 127935680kb exceeded limit 125829120kb #4

Open uqvirg opened 6 years ago

uqvirg commented 6 years ago

Hi, I tried to used IsoCon for the sugarcane transcriptome (10Gb Genome, highly polyploidy, 100-130 chromosomes, Aneuploid with varying ploidy level).

And after 7 days with 24CPU, I had this error, the memory allocated has been exceeded (120gb).

########################### Execution Started ############################# /var/spool/pbs/mom_priv/jobs/100062.tinmgmr1.SC: line 17:  : command not found /var/spool/pbs/mom_priv/jobs/100062.tinmgmr1.SC: line 20:  : command not found =>> PBS: job killed: mem 127935680kb exceeded limit 125829120kb ########################### Job Execution History ############################# JobName:IsoConSugar SessionId:24450 ResourcesRequested:mem=120gb,ncpus=24,place=free,walltime=326:00:00 ResourcesUsed:cpupercent=2400,cput=1596:37:06,mem=127935680kb,ncpus=24,vmem=211531212kb,walltime=136:17:10 QueueUsed:Long ExitStatus:271

The files generated are: 4096 Apr 3 18:31 alignments ## empty repertory 78263932 Apr 6 15:08 candidates_step_10.fa 78310448 Apr 6 20:41 candidates_step_11.fa 78330217 Apr 7 02:16 candidates_step_12.fa 78345513 Apr 7 07:50 candidates_step_13.fa 78355500 Apr 7 13:27 candidates_step_14.fa 78359544 Apr 7 19:00 candidates_step_15.fa 78364148 Apr 8 00:35 candidates_step_16.fa 78365950 Apr 8 06:10 candidates_step_17.fa 78366454 Apr 8 11:43 candidates_step_18.fa 78366454 Apr 8 17:17 candidates_step_19.fa 78366454 Apr 8 22:53 candidates_step_20.fa 78366454 Apr 9 04:27 candidates_step_21.fa 78366454 Apr 9 10:01 candidates_step_22.fa 77594624 Apr 9 10:01 candidates_step_23.fa 62809976 Apr 4 16:32 candidates_step_2.fa 67331407 Apr 4 23:02 candidates_step_3.fa 72215051 Apr 5 05:12 candidates_step_4.fa 75426509 Apr 5 11:05 candidates_step_5.fa 77103366 Apr 5 16:46 candidates_step_6.fa 77782120 Apr 5 22:22 candidates_step_7.fa 78103952 Apr 6 03:59 candidates_step_8.fa 78212789 Apr 6 09:35 candidates_step_9.fa 0 Apr 3 18:31 filtered_reads.fa 0 Apr 3 18:31 logfile.txt

So, can we use IsoCon for organisms with large polyploid genome ? If so do you have a way, a program that we can used to finish the process without to have to re-run and generated again, in this case these 23 files.fa.

Thank you, Virgg

ksahlin commented 6 years ago

Hi Virgg,

IsoCon is designed for targeted Iso-Seq data and this is why you see the long runtime and large memory consumption on a nontargeted dataset, for more details on why see issue 3. However, IsoCon could work (both better and way faster/less memory) if this dataset could be batched into subsets of similar CCS reads (e.g., roughly similar lengths and sequences), see issue 2. This should be "fairly straightforward" by either CCS-read-to-read alignment (minimap2 CCS all-vs all) or alignment to reference.

Given that you have reached 23 iterations in the correction phase, the sequences in candidates_step_23.fa should be close to the end of the correction phase. What this means is that these sequences will likely be of higher quality than original CCS reads. However, due to the crash, IsoCon did not start the second phase where it statistically validate the sequences (and remove non-significant ones) --- so there will be redundancy. You can try to start with the statistical step see here with -candidates candidates_step_23.fa, but there will be long runtime here as well (setting --ignore_ends 0 might speed it up but, with quality tradeoff). I want to stress that IsoCon has not been tested on nontargeted data. So, preprocessing the CCS reads into subset batches and starting IsoCon from beginning on each batch separately is my best advice that could make IsoCon suitable for calling and phasing highly similar transcripts in polyploid genomes.

I might also add that we are working on making a nontargeted version of IsoCon that will use minimap2 for alignments, which will reduce both speed and memory.

Best, Kristoffer

ksahlin commented 6 years ago

minor correction: Iteration 22 is the last correct one as the file in iteration 23 seems to be truncated (looking at filesize). So statistical test with -candidates candidates_step_22.fa would be the better option if you decide to try that.

uqvirg commented 6 years ago

Thanks for you answers. I'll let you know.

uqvirg commented 6 years ago

Hi Kristoffer,

I run IsoCon pipeline -fl_reads isoseq_flnc.fasta -candidates /isoconOutput/candidates_step_22.fa -outfolder /isoconOutput2 --ccs reads_of_insert.fastq --ignore_ends 0 and I had the error usage: Pipeline for obtaining non-redundant haplotype specific transcript isoforms using PacBio IsoSeq reads. [-h] [--version] {pipeline,get_candidates,stat_filter} ... Pipeline for obtaining non-redundant haplotype specific transcript isoforms using PacBio IsoSeq reads.: error: unrecognized arguments: -candidates candidates_step_22.fa

So, I tried to run: IsoCon stat_filter -fl_reads isoseq_flnc.fasta -candidates /isoconOutput/candidates_step_22.fa -outfolder /isoconOutput2 --ccs reads_of_insert.fastq --ignore_ends 0 and had a truncated file error: [W::sam_read1] Parse error at line 2 Traceback (most recent call last): File "/IsoCon", line 298, in run_stat_filter(params) File "/IsoCon", line 126, in run_stat_filter isocon_statistical_test.stat_filter_candidates(params.fl_reads, params.candidates, read_partition, to_realign, params) File "/IsoCon/modules/isocon_statistical_test.py", line 193, in stat_filter_candidates ccs_dict_raw = ccs_info.get_ccs(ccs_file) File "/IsoCon/modules/ccs_info.py", line 321, in get_ccs for read in ccs_file.fetch(until_eof=True): File "pysam/libcalignmentfile.pyx", line 2177, in pysam.libcalignmentfile.IteratorRowAll.next IOError: truncated file

I didn't investigate to much but do you have an idea of the problem ? Thank you, Virgg

ksahlin commented 6 years ago

Hi Virgg,

Yes, it should be stat_filter for running only the statistical test, this is an error in the documentation that I should fix.

Regarding the runtime error in stat_filter: you are giving a fastq file to the --ccs parameter, while this should be a bam file. If you have a fastq file of the CCS reads (with quality values provided by the ccs caller program), you can omit the --ccs parameter and simply run

IsoCon stat_filter -fl_reads **isoseq_flnc.FASTQ** 
                   -candidates /isoconOutput/candidates_step_22.fa
                   -outfolder /isoconOutput2 
                    --ignore_ends 0

Otherwise, with a read fasta file, you have to run

IsoCon stat_filter -fl_reads isoseq_flnc.fasta 
                   -candidates /isoconOutput/candidates_step_22.fa 
                   -outfolder /isoconOutput2 
                   --ccs **reads_of_insert.BAM** 
                   --ignore_ends 0

Best, K

uqvirg commented 6 years ago

Thank you ! It's running, I'll let you know.

uqvirg commented 6 years ago

Hi Kristoffer,

I have run: IsoCon stat_filter -fl_reads isoseq_flnc.fastq -candidates /isoconOutput/candidates_step_22.fa -outfolder /isoconOutput2 --ignore_ends 0

which has generated the files listed below and stopped with the error "IOError: truncated file".

######################

uqvperlo@tinaroo1:.../isoconOutput2> ls -l total 827072 drwxr-xr-x 2 4096 Apr 10 23:53 alignments -rw-r--r-- 1 47869024 Apr 12 02:32 candidates_after_step_1.fa -rw-r--r-- 1 47691657 Apr 12 10:25 candidates_after_step_2.fa -rw-r--r-- 1 47677256 Apr 12 18:20 candidates_after_step_3.fa -rw-r--r-- 1 47677256 Apr 13 02:02 candidates_after_step_4.fa -rw-r--r-- 1 0 Apr 11 07:48 filtered_reads.fa -rw-r--r-- 1 0 Apr 11 07:48 logfile.txt -rw-r--r-- 1 52330578 Apr 11 07:48 preprocessed_candidates.fa -rw-r--r-- 1 360081287 Apr 13 02:02 remaining_to_align.fa -rw-r--r-- 1 52330578 Apr 11 07:53 temp_candidates_step_1.fa -rw-r--r-- 1 47869024 Apr 12 02:32 temp_candidates_step_2.fa -rw-r--r-- 1 47691657 Apr 12 10:25 temp_candidates_step_3.fa -rw-r--r-- 1 47677256 Apr 12 18:20 temp_candidates_step_4.fa -rw-r--r-- 1 47677256 Apr 13 02:02 temp_candidates_step_5.fa

########################################## [W::sam_read1] Parse error at line 2 Traceback (most recent call last): File "/IsoCon", line 298, in run_stat_filter(params) File "/IsoCon", line 126, in run_stat_filter isocon_statistical_test.stat_filter_candidates(params.fl_reads, params.candidates, read_partition, to_realign, params) File "/IsoCon/modules/isocon_statistical_test.py", line 193, in stat_filter_candidates ccs_dict_raw = ccs_info.get_ccs(ccs_file) File "/IsoCon/modules/ccs_info.py", line 321, in get_ccs for read in ccs_file.fetch(until_eof=True): File "pysam/libcalignmentfile.pyx", line 2177, in pysam.libcalignmentfile.IteratorRowAll.next IOError: truncated file

##########################################

Do you have an idea about what can generated this pysam error ? Thank you for your help.

Virgg

ksahlin commented 6 years ago

Hi Virgg,

This should not happen. At the very start of the run, IsoCon prints current parameter settings. Can you confirm that the --ccs parameter is not specified for this run, i.e., that its set to None? This is an example on how the first lines look like

nr_cores: 16
p_value_threshold: 0.01
ignore_ends_len: 15
ccs: None
min_candidate_support: 2
neighbor_search_depth: 4294967296
cleanup: False
is_fastq: False
which: pipeline
fl_reads: /Users/kxs624/Documents/data/pacbio/simulated/ISOseq_sim_n_2000/simulated_pacbio_reads.fa
outfolder: /Users/kxs624/tmp/isocon_n_2000_nt
min_exon_diff: 20
max_phred_q_trusted: 43
min_test_ratio: 5
prefilter_candidates: False
verbose: False

The reason I ask is that the pysam error is generated in a code segment that is entered only if the --ccs flag is specified. The pysam error is because something with the ccs.bam file seems to be formatted incorrectly (something we can investigate at a later point), but according to your parameters there should not be a bamfile.

On another note, you can always resume IsoCon from the last temp_candidates_step_X.fa in case IsoCon does not finish. That is, pass -candidates temp_candidates_step_X.fa. This will save time as IsoCon is deterministic, so you would get the same temp_candidates_step_X.fa regenerated with the same parameter settings. In this case restarting it would be with temp_candidates_step_5.fa -- of course relevant only if everything else was specified correctly.

ksahlin commented 6 years ago

Hi again,

Have you tried pacbio's Iso-Seq ToFU or ToFU2 pipeline for your dataset? I saw that both the ToFU and ToFU2 pipelines contains a separate program preCluster (preCluster - ToFU, preCluster-ToFU2) that aims to split CCS reads into batches with similar reads. It might be just what is needed for IsoCon to work well with your dataset.

Note: The precluster step in ToFU2 is much more sophisticated than preCluster in ToFU as it separates on sequence similarity and not just on the length of the reads. However, I think any of them will give an improvement.

uqvirg commented 6 years ago

Hi Kristoffer, After to have run: IsoCon stat_filter -fl_reads isoseq_flnc.fastq -candidates /isoconOutput/candidates_step_22.fa -outfolder /isoconOutput2 --ignore_ends 0 which has generated the files listed below and stopped with the error "IOError: truncated file" (pysam error) but has created the file 47677256 Apr 13 02:02 candidates_afterstep5.fa

I have run IsoCon stat_filter -fl_reads isoseq_flnc.fastq -candidates /isoconOutput/temp_candidates_step_5.fa -outfolder /isoconOutput3 --ignore_ends 0 XXXXXXXXXXXXXXXXXX nr_cores: 16 p_value_threshold: 0.01 ignore_ends_len: 0 min_test_ratio: 5 ccs: None min_candidate_support: 2 minimap_alignments: /isoconOutput3/minimapped tempfolder: /isoconOutput3/alignments neighbor_search_depth: 4294967296 cleanup: False candidates: /isoconOutput2/temp_candidates_step_5.fa filtered_reads: <open file '/isoconOutput3/filtered_reads.fa', mode 'w' at 0x2aaab2c19150> is_fastq: True which: stat_filter fl_reads: /isoseq_flnc.fastq outfolder: /isoconOutput3 min_exon_diff: 20 logfile: <open file '/isoconOutput3/logfile.txt', mode 'w' at 0x2aaab2c191e0> max_phred_q_trusted: 43 verbose: False 2018-04-14 20:18:15.421908 Starting. XXXXXXXXXXXXXXXXXXXXXX

And the process has finished without error.

processing 30500 Edges in candidate NN graph: 10317 ('Unique before compression: ', 30533) ('Unique after compression: ', 29010) Edges in candidate homopolymenr invariant graph: 3912 Total union of edges: 11307 Total edges after removing dominant candidates: 3 NUMBER OF CANDIDATES LEFT: 30533 . Number statistical tests in this round: 3 Normal termination Total number of tests performed this round: 3 Median corrected p-val: 4.72899566405e-15 Number of unique candidates tested: 9485 Filtering threshold (p_val*mult_correction_factor): 0.01 nr candidates left: 30533 Candidates written to file: 30533 Candidates written to file: 30533

Final_candidates.fa has been generated with 30533 isoforms (I have 30 271 isoforms HQ Polished Isoforms with ICE and Quiver - I haven't check the difference yet)
the files generated are: 0 Apr 14 20:18 filtered_reads.fa 4096 Apr 14 20:18 alignments // empty 47677256 Apr 14 20:18 preprocessed_candidates.fa 47677256 Apr 14 20:24 temp_candidates_step_1.fa 47675363 Apr 15 13:34 candidates_after_step_1.fa 47675363 Apr 15 13:34 temp_candidates_step_2.fa 47673985 Apr 15 21:31 candidates_after_step_2.fa 47673985 Apr 15 21:31 temp_candidates_step_3.fa 47673985 Apr 16 05:24 candidates_after_step_3.fa 47673985 Apr 16 05:24 temp_candidates_step_4.fa 47175064 Apr 16 22:36 candidates_after_step_4.fa 47175064 Apr 16 22:36 temp_candidates_step_5.fa 47151463 Apr 17 06:30 candidates_after_step_5.fa 47151463 Apr 17 06:30 temp_candidates_step_6.fa 47149837 Apr 17 14:19 candidates_after_step_6.fa 47149837 Apr 17 14:19 temp_candidates_step_7.fa 15971 Apr 17 14:19 remaining_to_align.fa 47149837 Apr 17 22:11 candidates_after_step_7.fa 20819932 Apr 17 22:11 cluster_info.tsv 47897243 Apr 17 22:11 final_candidates.fa 1732 Apr 17 22:11 logfile.txt

The process seems to have generated the expected files, isn't it ? I haven't try the preCluster with Tofu, I don't know if the results will be different ( or just the time of the process) ?

Thank you again, appreciate any advise. Virgg

ksahlin commented 6 years ago

Yes, IsoCon has finished without error.

My guess is that runtime will be greatly improved with preCluster, but will give fairly similar predictions. However, running IsoCon on a nontargeted dataset with --ignore_ends 0 as above likely gives many redundant transcript predictions. Improved runtime opens up for the possibility to set --ignore_ends to e.g. 50 (value of 40-100 preferred on non-targeted data). With this value, the results will likely be fairly different (giving a less redundant set of predicted transcripts).

The redundancy is because IsoCon won't statistically validate (and filter) transcripts that differs where they have been cut in ends, with --ignore_ends 0. The cut positions are very variable for a nontargeted dataset (compared to targeted). Some of this redundancy can be removed in the output by post processing the predictions and removing transcripts that are perfect substrings of another transcript, but this will only remove some redundancy.

uqvirg commented 6 years ago

Thank you !