Open eprdz opened 11 months ago
Thank you for reporting.
lima (pacbio) or pychopper (ONT) pre-proccessing tools can be used to remove long degenerate reads. I recommend using one of them for any preprocessing of "raw" reads.
The peak you saw could have happened in the sorting step (prior to clustering).
Yes, I could add a warning message and flag reads above a certain length threshold (but I think taking care of these reads with a preprocessing tool is the way to go).
Best, Kristoffer
Additionally, what parameters do you run isONclust
with? Parameters such as --k and --w can affect runtime and memory usage significantly.
Hi, first of all, thank you for your feedback but I could not answer back before.
I ran pychopper and those extremely long reads were still in the dataset, so I removed them manually and everything went well.
Moreover, in order to run isONclust, I used the _fullpipeline.sh script with the full option in the isONform repository, so I think that the following commands were executed:
/usr/bin/time -v isONclust --t $num_cores --ont --fastq $outfolder/full_length.fq \
--outfolder $outfolder/clustering
/usr/bin/time -v isONclust write_fastq --N $iso_abundance --clusters $outfolder/clustering/final_clusters.tsv \
--fastq $outfolder/full_length.fq --outfolder $outfolder/clustering/fastq_files
Thanks again for your help.
/usr/bin/time -v isONclust --t $num_cores --ont
I see, then my first answer stands. My second answer was in reference to this comment in the isONform repo: https://github.com/aljpetri/isONform/issues/16#issuecomment-1872621172:
Hi, Sorry to open up again this issue. I have a question regarding this.
As I said last time, I implemented an in-house filtering step before isONclust to remove reads longer than 5kb, as I have seen that datasets with reads longer than that length are very time and memory consuming. Nevertheless, some of these reads are not artifacts and I want to use them for isONcorrect and isONform. Do you know if there is a way to "rescue" those reads in isONcorrect and isONform?
Thank you again!
Hi again!
Two new answers:
Understood! Thank you!
Hi I have now set the code repository for isONclust3 to public. The Code can be found via: https://github.com/aljpetri/isONclust3 Please let us know how testing the tool worked out for you.
I was using isONclust in parallel as a previous step to define a transcriptome using ONT data with isONform. I looked at the memory profiling of isONclust and after a few minutes, when almost reaching the memory limit (125 Gb), the memory consumption of isONclust dropped at 40-50 Gb. IsONclust "seems to be working" as the command line did not appear and no error was thrown but only 1 thread out of the total that were running was actually alive. I realized that there were 2 reads that were very long (<100 Kb), while the other reads were 10 Kb long at most. I removed those outlayers and now it seems to work.
I was thinking that maybe some error might be thrown in order not to lead to confusion.
Thanks for your time!