ksahlin / isONclust

De novo clustering of long transcript reads into genes
GNU General Public License v3.0
48 stars 8 forks source link

Error on write_fastq #25

Closed NHoang98 closed 3 weeks ago

NHoang98 commented 2 months ago

Hello, Firstly, thank you for the package! We are currently trying the tool for our transcriptomic data set. The tool works perfectly fine in clustering mode but we encountered an error when extracting clustered fastq (write_fastq).

For detail: The data was clustered by using: isONclust --t 20 --ont --fastq Documents/Gac/RNAseq/3.filtering/Aril/A1.fq --outfolder Documents/Gac/RNAseq/4.mapping/isONclust/Aril/cluster/A1 isONclust write_fastq --N 1 --clusters Documents/Gac/RNAseq/4.mapping/isONclust/Aril/cluster/A1/final_clusters.tsv --fastq Documents/Gac/RNAseq/3.filtering/Aril/A1.fq --outfolder Documents/Gac/RNAseq/4.mapping/isONclust/Aril/cluster/A1/fastq_files

And then the error returned shortly after the 2nd command was run: Traceback (most recent call last): File "/home/cmmr/anaconda3/bin/isONclust", line 217, in write_fastq(args) File "/home/cmmr/anaconda3/bin/isONclust", line 164, in write_fastq seq, qual = reads[acc]


KeyError: '28dcdb4a-59e7-4953-8a93-74bed5b2449f_st:Z:2024-07-18T23:14:47.352+00:00'

At this point, there is no clue for us to fix the problem. Could you please take a moment to check this error out?

Thanks in advance!
ksahlin commented 2 months ago

The error says this read accession is not present in the fastq file. So somehow the clustered read is not found in the fastq file.

NHoang98 commented 2 months ago

Thanks for fast reply, Well, the directory 3.filtering was actually for filtered reads after pychopper step (we discard <100bp read by Filtlong package). The .fq output from Filtlong is the input for inONclust . Do you think the way Filtlong produces output might lead to this consequence?

ksahlin commented 2 months ago

hmm, I don't think so. If the read file given as input to the clustering is the same as the read file given as input to the write_fastq, and the final_clusters.tsv is produced using the same read file as for the clustering, then I don't see how this error can happen.

NHoang98 commented 1 month ago

Hi @ksahlin,

Sorry for bringing this back again after a long time. At the end of the day, we still find that isONclust is the best choice for our experiment. This time we tried the sorted.fastq in isONclust output folder instead of the input .fq and the error is still there. So I think there might be a problem with my .fq header? I've run the sample_alz_2k.fastq file multiple times and it seems normal. But since your test file look like from pacbio ccs run, I didn't know my fastq headers are in the right format

This is example of one of the header that i copied from the sorted.fastq file

@b1c4fdcc-1f64-48d9-9e27-b6d9ed152198_st:Z:2024-07-19T12:03:20.721+00:00 RG:Z:ee6fa1023bdc36c23924a004f40565b31c16f1c6_dna_r9.4.1_e8_sup@v3.6_SQK-PCB111-24_barcode01_39652.62084870732

fyi: when I compare the final_cluster_origins.tsv between my run and the test run, it looks like my header accidentally split into 2 columns. The pictures are attached right below: test_run A1_run

Hope to hear great news from you soon!

ksahlin commented 1 month ago

How about modifying the accessions of all reads before clustering, e.g. by moving everything after the first underscore with: line[1:].split('_')[0] to only get the b1c4fdcc-1f64-48d9-9e27-b6d9ed152198 part. Maybe this helps.

Also, we're about to release isONclust3 any time now (likely within a week or two). Let me know if you're interesting trying this tool out and we can arrange access before the release.

NHoang98 commented 1 month ago

I have the same idea, we renamed and subsampled (around 10k reads) with seqkit. Things run smoothly just like butter!

About the new version, glad to hear about the release! Our data needs like 2-3 weeks to be analyzed by the original version and I think that the time is fit with the release date of the new version. At that point, we are happy to try it on our data!

ksahlin commented 1 month ago

Okay, we'll let you know when it is released.

2-3 weeks for getting results sounds quite bad. I dare to bet that you'll be able to see more than 10x speedup with isONclust3.

CC @aljpetri

aljpetri commented 3 weeks ago

Hi I have now set the code repository for isONclust3 to public. The Code can be found via: https://github.com/aljpetri/isONclust3 Please let us know how testing the tool worked out for you.

NHoang98 commented 3 weeks ago

Glad to hear about the release! Could you please update the usage in the new version repository? We will try on our data and feedback as soon as we can! Also I'll close this issue since it has been solved!