Usage of --consensus to generate a non redundant fasta file with unique transcript sequences

ALSP6OCT commented 4 years ago

Hello, I would like to generate a single fasta file with all unique transcripts using the --consensus parameter from reads generated by ONT cDNA sequencing. I have tried to do so on 2 different fastq files with varying number of reads. On the smaller file it exits out generating a fasta file without any sequences. In another instance it gave the following error; Total number of reads iterated through:255873 Passed mapping criteria:12291 Passed alignment criteria in this process:5110 Total calls to alignment mudule in this process:5435 Time elapesd clustering last iteration single core: 267.3920338153839 Time elapsed clustering: 969.0639133453369 Nr clusters larger than 1: 22938 Nr clusters (all): 238472

STARTING TO CREATE CLUSTER CONSENSUS

Temporary workdirektory for polishing: /tmp/tmptcwada2z creating center of 62799 sequences. Traceback (most recent call last): File "/home/superuser/ENTER/bin/isONclust", line 263, in main(args) File "/home/superuser/ENTER/bin/isONclust", line 127, in main center = consensus.run_spoa(reads_path.name, os.path.join(work_dir,"spoa_tmp.fa"), "spoa") File "/home/superuser/ENTER/lib/python3.8/site-packages/modules/consensus.py", line 89, in run_spoa subprocess.check_call([ spoa_path, reads, "-l", "0", "-r", "0", "-g", "-2"], stdout=output_file, stderr=null) File "/home/superuser/ENTER/lib/python3.8/subprocess.py", line 359, in check_call retcode = call(*popenargs, *kwargs) File "/home/superuser/ENTER/lib/python3.8/subprocess.py", line 340, in call with Popen(popenargs, **kwargs) as p: File "/home/superuser/ENTER/lib/python3.8/subprocess.py", line 854, in init self._execute_child(args, executable, preexec_fn, close_fds, File "/home/superuser/ENTER/lib/python3.8/subprocess.py", line 1702, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'spoa'

Could you please let me know, if spoa is installed as part of the isONclust package? I ask because of the lack of any error message associated with the smaller fastq file, where it generated a fasta file without inputting any sequences. Thanks!

ksahlin commented 4 years ago

Hi @ALSP6OCT!

The --consensus parameter is experimental and was implemented after the original publication of isONclust, therefore the answer is NO, spoa needs to be installed separately and put in your path, but this is very easy, see here.

However, I see from the output that the number of singleton clusters (containing one read) is very high, around 200,000, with only 255,873 reads at start(?). This means that very few large clusters are formed ( number of clusters with more than one read is 22,938). So you will likely be disappointed with the outcome. Are you expecting very few large clusters?

Best, Kristoffer

ksahlin commented 4 years ago

Also, I would use the parameter --ont for ONT cDNA data in case you haven't, because the default --k and --w are for IsoSeq data and doesn't work as well for ONT.

ALSP6OCT commented 4 years ago

Thank you for promptly getting back to me. I am using the - -ont option and the - - consensus option on the output fastq file from Pychopper. I have also tried to install Spoa but I am unable to do so due to tcp errors (something to the effect of Connection closed by peer) Using git.

I do have larger ont fastq files ( samples that generated higher quality reads). However, these files won’t even complete the clustering step on isONclust. Would you recommend isONclust2 for doing so? If yes, do you know of an option to generate consensus from each fastq file for a cluster? Thank you!

ksahlin commented 4 years ago

Oh okay, did you try using more cores with --t? How many reads do you have? One issue that I've seen before is that a large fraction of the reads are very small (e.g., under 100 nucleotides) this will significantly slow clustering down and can be fixed by removing reads below a certain length (e.g. 100nt) before running isONclust. Alternatively, it may potentially also be solved by lowering parameter --min_shared to e.g 3 (instead of default of 5).

Yes, isONclust2 is faster and can handle larger data. I'm unsure if there is a consensus option, but I remember Botond (developer of https://github.com/nanoporetech/pipeline-nanopore-denovo-isoforms) mentioned something about such an option. Hopefully, he'll get back to you soon (I saw you posted an issue in that repo).

ALSP6OCT commented 4 years ago

Hello, Thank you for your input. I am using --t 24. The larger file I have has the following; Mean read length: 573.7 Mean read quality: 8.3 Median read length: 476.0 Median read quality: 8.7 Number of reads: 8,219,100.0 Read length N50: 759.0 Total bases: 4,715,369,850.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 7254613 (88.3%) 4278.5Mb Q7: 5942847 (72.3%) 3645.6Mb Q10: 2170584 (26.4%) 1685.0Mb Q12: 248376 (3.0%) 240.0Mb Q15: 107 (0.0%) 0.1Mb

I will definitely try removing reads lower than 100 bp. If it still doesn't work, I will reduce the --min_shared to 3. Thanks again!

ksahlin commented 4 years ago

Sure no problem! It's still not entirely sure that it will complete: Another potential issue is if the dataset has a lot of intra-priming of genomic regions (hallmark is a poly-A region on the genome right next to the read alignment), then such reads will slow down the clustering significantly because they are not from gene regions and many of these reads are unique. The level of intrapriming depends on the ONT protocol and which organism you are working on.

There is currently no method to remove such intrapriming genomic reads without using a reference genome. These reads can be filtered out based on the alignment to the genome. Depends on which organism you are working on, maybe you don't have access to a reference genome.

ALSP6OCT commented 4 years ago

Yes, you are right, the species (a pest found in agricultural storages) lacks a reference genome. I do have access to a reference transcriptome generated using short read sequencing. However, the whole idea was to sequence using long read technology to obtain full length clean transcripts, especially on certain genes which were missing 5' sequence in the previous transcriptome. The samples for current round of sequencing are obtained from storage locations whereas the short read sequencing was conducted on individual samples reared in lab. Thank you for providing additional insights on intra priming of genomic regions. We deal with other species that do have reference genomes, and an additional cleanup using the reference can be done.

ksahlin commented 4 years ago

In the case intra-priming is a substantial issue and where you don't have a reference:

Another approach is to map the short reads to the long ones. The long reads would get close to 0 reads mapped to them under the following assumptions:

short read data has higher coverage and contain all transcripts (or at least gene regions) found in the long read dataset
The short read dataset does not have genomic contamination from the same regions.

1 should probably hold but I'm unsure about 2 as I'm not an expert on short-read RNA-seq protocols.

ksahlin / isONclust

Usage of --consensus to generate a non redundant fasta file with unique transcript sequences #11