ncbi / egapx

Eukaryotic Genome Annotation Pipeline-External caller scripts and documentation
Other
80 stars 8 forks source link

Method for using PacBio Iso-Seq reads? #13

Closed dluecke closed 5 months ago

dluecke commented 5 months ago

Hello, We are hoping to use this pipeline with RNA sequencing from the PacBio Iso-Seq platform. I am using an egapx module on a SLURM scheduled cluster, and have confirmed it works with the provided example files. I then tried putting the Iso-Seq fasta file paths in the 'reads' section of the input yaml file but ran into the following error:

ERROR ~ index is out of range 0..-1 (index = 0)

 -- Check script '/software/el9/apps/egapx/0.1.2-alpha/nf/./subworkflows/ncbi/./rnaseq_short/star_wnode/main.nf' at line: 83 or see '/90daydata/vpgru/DavidLuecke/egapx_testing/Cmac_01/Cmac_01/nextflow.log' file for more details

Line 83 of this script looks to be processing read pair files, which wouldn't apply to Iso-Seq data. I looked but didn't see a 'transcripts' or similar option as a yaml header. Is there an option for using this RNA-seq data type for the current version of egapx, or some other workaround for this issue? Thank you!

etvedte commented 5 months ago

See issue #2 .

The current EGAPx RNA-seq plane is designed for short read data. We will support Iso-Seq data in a future release.

Eric

dluecke commented 5 months ago

I see, thank you for your quick response and I look forward to Iso-Seq support in future versions. For this EGAPx version I'm inclined to find and translate ORFs in the Iso-Seq transcripts, then pass along those protein sequences under the 'protein' YAML field. Is this in line with the current version's capacities?

thanks again, David

etvedte commented 5 months ago

Can you? Yes.

Should you? Not sure.

EGAPx will automatically retrieve curated protein sets by setting tax-id without a protein argument. You could figure out what set is retrieved, and then download it from https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/target_proteins/ and add the translated ORFs. But I do think that supplying short RNAseq from your sample or from SRA will improve performance. Relying on translated ORFs as protein data alone might result in accuracy issues. It would be up to you to assess the performance of the approach.

dluecke commented 5 months ago

OK this is great advice, thank you very much!

etvedte commented 5 months ago

FWIW my colleague here says "Not recommended." The gene prediction performed by Gnomon in EGAPx places a lot of weight on protein evidence, so including bogus proteins would be detrimental.

We hope to have direct support for Iso-Seq in a few months.