nanoporetech / pinfish

Tools to annotate genomes using long read transcriptomics data
Other
45 stars 13 forks source link

Using medaka with transcriptome data #15

Closed MartinTes closed 4 years ago

MartinTes commented 4 years ago

Hi,

I am new with Pomoxis and I am trying to run a polishing step on budgerigar transcriptomic data after genome reference-based assembly but I am getting an error in the step of generating consensus.

First, I a successfully created reference-based assembly (with the budgerigar genome but only with the longest transcript per gene selected) with the following script:

!/bin/bash

NPROC=$(nproc) BASECALLS=data/Pool4_merged_trimmed_BC.fastq REFERENCE=/home/martin.tesicky/Medaka_parrots/Pool4/medaka_walkthrough/Budgerigar_genome_longest_transcript_mart_export.fasta mini_assemble -i ${BASECALLS} -r ${REFERENCE} -o draft_assm_with_reference -p assm -t ${NPROC}

Then, I run a Polishing a Consensus as follows:

!/bin/bash

NPROC=$(nproc) BASECALLS=data/Pool4_merged_trimmed_BC.fastq DRAFT=draft_assm_with_reference/assm_final.fa OUTDIR=medaka_consensus_with_reference_1_12 medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC} -m r941_min_h

When I am doing the medaka consensus step the fasta file with the consensus sequence is always empty and I am getting calls from logfile: "Failed to stitch consensus chunks." and from a terminal (please, see below and see also complete output in attached files). I also tried to use de-novo assembly without using reference but I got a very low number of contigs (only ca 1500) and polishing step was working with this attitude. When I using reference-based assembly I have many more contigs (ca 12 000), but the consensus step is not working. I would also welcome any piece of advice on which parameters should I adjust for de-novo assembly since we have also other species where the genomes are not available @cjw85.

I would be very grateful for any help.

[file):]([url](url typescript.txt medaka_consenus_runner_with_reference__01_12_logfile.txt

))

Traceback (most recent call last): File "/home/martin.tesicky/miniconda3/envs/medaka/bin/medaka", line 11, in sys.exit(main()) File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/medaka.py", line 431, in main args.func(args) File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/stitch.py", line 137, in stitch for contigs in executor.map(worker, regions): File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/concurrent/futures/process.py", line 496, in map timeout=timeout) File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/concurrent/futures/_base.py", line 575, in map fs = [self.submit(fn, args) for args in zip(iterables)] File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/concurrent/futures/_base.py", line 575, in fs = [self.submit(fn, args) for args in zip(iterables)] File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/concurrent/futures/process.py", line 139, in _get_chunks chunk = tuple(itertools.islice(it, chunksize)) File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/common.py", line 610, in grouper batch.append(next(gen)) File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/stitch.py", line 130, in (common.Region.from_string(r) for r in args.regions), File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/common.py", line 469, in from_string start = int(bounds) ValueError: invalid literal for int() with base 10: 'ubiquinone' (medaka) martin.tesicky@turbacz:~/Medaka_parrots/Pool4/medaka_walkthrough$ (medaka) martin.tesicky@turbacz:~/Medaka_parrots/Pool4/medaka_walkthrough$ ls assess_assembly_runner.bash assess_assembly_stats_runner.bash

cjw85 commented 4 years ago

Hi @MartinTes

I think your issue is related to a previous (https://github.com/nanoporetech/medaka/issues/93) and stems from how medaka interprets sequence names in fasta/fastq/bam files. It looks like we need to review our walkthrough, its a little stale.

For denovo assembly of your genomes I would likely recommend to use canu, followed by racon and then medaka.

Apologies for the short response, we're a little busy this week, I hope it is helpful nonetheless.

MartinTes commented 4 years ago

Hi @cjw85 cjw85, thank you very much for your help. You were right. When I took the script from (#93) and applied it to my data it started to work. Now reference-based assembly is running after polishing but we can use this approach only for one species – for budgerigar where genome is available and for other species without genomes, we have to try de-novo transcriptome assembly. However, I have just encountered another issue.

For species without reference genomes, we decided to first map our sequences against budgerigar genome with the selected only longest transcript per gene and based on that we further selected for draft assembly only those reads that were successfully mapped: Minimap2 (running independently without using Pomoxis pipeline) -> SAM file -> sorting and indexing of SAM files in Samtools and converting them into BAM files using Samtools (script_from_mapping_to_stats.bash). Then I converted all mapped reads from BAM file to FASTQ using Bedtools and used these preselected raw sequences as an input file for de-novo transcriptome assembly (BAM_extraction_FASTAQ_convesion_script.sh).

The reason behind is that we believe that by removing many unwanted reads (either non-mapped reads or mapped reads to some short alternative splicing variants) we could improve clustering. Is it right assumption? As I mentioned last time, for de-novo transcriptome clustering without reference using Pomoxis and Medaka we were struggling with a very low number of contigs (ca 1500).

However, when I tried to run draft assembly by this way first with all genes and then only with reads selected for one trial gene with different coverage (1 gene with 10 reads, 30 reads, 100 reads and 1000 reads; reads_selected_transcripts.txt.fq), the program was always terminated from unknown reasons leaving empty .fasta file (for some files after first racon step, whereas for other after the third step, please see attach output from screens:reads_selected_typescript.txt and logfile: reads_selected_transcripts.txt.fq_logfile.txt). There might be something wrong with file format after SAM/ BAM/ FASTQ manipulation… For the final conversion, we also tried to use different scripts for FASTQ conversion and we also performed the conversion to FASTA but the issue was persisting for all these cases. Actually, when I use raw FASTQ reads just after sequencing only with trimmed out adaptors as an input (Raw_fastq_reads.fastq – example with cutted only a few sequences), the program is working but when I used reads that were extracted from BAM, the program is not working properly (reads_selected_transcripts.txt.fq). Please see all these attached files in ZIP folder including scripts that we used. Medaka_issue_2019_12_12_MT.zip

For our data, you suggested to use Canu but for de-novo assembly, we don´t have any genomic data but only transcriptomic data and many assemblers including Canu are not just intended for this type of data. Actually, we tried to use Canu and after getting many bugs we contacted developers and they advised us to use Fly, but with running this Fly we are also having many bugs…

As far as we know, the tools for de-novo transcriptome clustering based on ONT data are needed to be developed. If there are any parameters that could be adjusted according your opinion in Pomoxis and Medaka for increasing number of contigs, it would be worth trying it (e.g. maximum and minimal number of reads, etc.). I would welcome any suggestions. My last question is related to running mini_assembly script implemented in Pomoxis. Do I understand right that for reference-based assembly there are also some polishing steps done by Racon that preceded medaka consensus with polishing? If yes, do reference-based assembly for species with different genomes could negatively influence the results, e.g. by implementing more errors? I would be very grateful for any suggestions and pieces of advice.

cjw85 commented 4 years ago

I'm going to defer to some others here with more experience dealing with transcriptome data. We have some tools for transcript clustering and in principle one could use medaka to derive high quality sequences once the sequences for which it is meaningful to take consensuses have been identified.

@bsipos

ksahlin commented 4 years ago

Gonna chime in here. It depends on your ultimate goal, but both CARNAC-LR and isONclust are reference-free tools that take nanopore transcript reads and cluster them into genes/gene-families. @bsipos and I are developing isONclust.

MartinTes commented 4 years ago

Dear @ksahlin and @bsipos,

we sequenced the transcriptomes of 6 parrot species from the intestine using Gridiron platform aiming to identify positively selected immune genes. For doing that, we would like to have high quality and noise-free ONT data for whole transcriptome scans of positive selection. Given their nice documentation and combining several tools in the pipeline, we started to play with Pomoxis and Medaka, but we encountered some technical issues (see above). One issue is that when we performed de novo clustering using draft assembly with the default setting for two species including budgerigar we got only ca 1500 contigs despite the fact that we know from independent Minimap2 mapping against the genome that there are more than 6000 genes/ transcripts above 30 reads. With reference-based assembly for budgerigar, we got more contigs (around 12 000), but this approach cannot be applicable for other species since we don´t have genomes available for these species.

We were also thinking about using isONclust for clustering, but we didn´t know how to continue with the output of this program. Is there any way how to transformed identified clusters from .csv files to fasta sequences and how they can be further processed (assessing error rate, polishing etc.)?

I would be very grateful for your recommendations.

@cjw85, thank you very much for getting me in touch with your colleges. Could you or your colleagues also have look at the BAM and FASTQ files from my previous comments, please, if there is anything wrong with them that draft assembly via POMOXIS is always producing empty files?

ksahlin commented 4 years ago

As for the output of isONclust: you can get separate fastq files for all the reads by running isONclust write_fastq after the clustering see docs.

As for forming references of each cluster: producing high-quality transcriptomes of ONT data without references is challenging, but there might be a tool for doing this soon. In the meantime, there is an experimental parameter in isONclust (--consensus). This parameter aims to produce one consensus sequence per cluster, and also filters reverse complements. This parameter is implemented for clustering some types of genomic targeted data, that is not for transcriptomic data, so if you want to try, use with caution. That is, there are at least two limitations for transcriptomic data that I can think of off the top of my head:

I should note that the parameter is available in the newest version of isONclust (0.0.6) and needs spoa here to be installed and available from command line. (There are plans to implement medaka to further polish the spoa-consensus).

bsipos commented 4 years ago

@MartinTes You can use the pinfish pipeline to obtain polished transcripts and GFF annotation for the datasets you have a reference. I think that there is a chance that it will work in the case of data mapped to a reference closely related to the actual species.

Trying to use medaka out of the box for assembling transcripts is unlikely to work as the canu assembler is not good at handling short contigs by default.

MartinTes commented 4 years ago

Hi @bsipos, thank you very much for your suggestion. We will definitively try to explore more pinfish pipeline and I will let you know about the results. I think that birds have quite good synteny in their genomes and using reference-based assembly could work also for other species where we don´t have their genomes.

I noticed that both Pomoxis pipeline and Pinfish employe Racon for polishing sequences and I have a question if I need to have for this polishing step a reference genome? If I use Pinfish pipeline for species with the genome of other closely related species cannot I implement any other errors to our data during polishing? I apologize for this question but I am actually not sure how Racon works within this pipelines,

Thank you very much.

MartinTes commented 4 years ago

Hi @ksahlin, thank you very much much. When I looked closely at my data for reference-based assembly for budgerigar that processed by Pomoxis (default model for Medaka polishing) it turned up that some immune genes not only with low coverage but also with higher coverage had even worsen quality of reads after polishing than before which is quite unexpected... and I will have to definitively explore more possibilities.

If understood well, I could try to do de-novo transcriptome-clustering using isONclust, then convert the output clusters to fasta files, employ spoa to make consensual sequences and then try to independently use Medaka for polishing with some pre-steps in Racon?

Many thanks in advance.

ksahlin commented 4 years ago

Hi @MartinTes,

I don't know how you polish transcript reads, but if you used any tool designed for genomic sequences, low quality it is expected because of exon (and other differences). You will likely get chimeric reads.

As for clustering the reads, isONclust should do a decent job. However, for the downstream "polishing" steps. Keep in mind the (significant) caveats listed in my previous message. That is, polishing the augmented (possibly incomplete) gene-consensus constructed by spoa with transcriptomic reads coming from different isoforms. That does not sound promising and I would guess it is prone to errors.

MartinTes commented 4 years ago

Hi @ksahlin, yes, that is a good point with polishing. Could it be a way to partially overcome this issue if we preselect raw reads by mapping them to the longest transcript per gene from the genome and then to take only these reads as an input?

bsipos commented 4 years ago

@MartinTes @ksahlin If you use pinfish for clustering then there should be no exon differences between the reads which are polished together. Moreover, since the reference is only used during clustering, the consensuses will represent the actual species. When using pinfish, preselecting reads is not necessary and one would obtain polished transcripts for all and not just the longest isoform.

ksahlin commented 4 years ago

Yes indeed, for any kind of reference-based analysis, go with pinfish. My comments are only in regards to a genome-reference free analysis.

MartinTes commented 4 years ago

@ksahlin @bsipos OK, thank you very much for your pieces of advice. I going to run pinfish and hope to have some results soon.