Closed dluecke closed 5 months ago
See issue #2 .
The current EGAPx RNA-seq plane is designed for short read data. We will support Iso-Seq data in a future release.
Eric
I see, thank you for your quick response and I look forward to Iso-Seq support in future versions. For this EGAPx version I'm inclined to find and translate ORFs in the Iso-Seq transcripts, then pass along those protein sequences under the 'protein' YAML field. Is this in line with the current version's capacities?
thanks again, David
Can you? Yes.
Should you? Not sure.
EGAPx will automatically retrieve curated protein sets by setting tax-id without a protein argument. You could figure out what set is retrieved, and then download it from https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/target_proteins/ and add the translated ORFs. But I do think that supplying short RNAseq from your sample or from SRA will improve performance. Relying on translated ORFs as protein data alone might result in accuracy issues. It would be up to you to assess the performance of the approach.
OK this is great advice, thank you very much!
FWIW my colleague here says "Not recommended." The gene prediction performed by Gnomon in EGAPx places a lot of weight on protein evidence, so including bogus proteins would be detrimental.
We hope to have direct support for Iso-Seq in a few months.
Hello, We are hoping to use this pipeline with RNA sequencing from the PacBio Iso-Seq platform. I am using an egapx module on a SLURM scheduled cluster, and have confirmed it works with the provided example files. I then tried putting the Iso-Seq fasta file paths in the 'reads' section of the input yaml file but ran into the following error:
Line 83 of this script looks to be processing read pair files, which wouldn't apply to Iso-Seq data. I looked but didn't see a 'transcripts' or similar option as a yaml header. Is there an option for using this RNA-seq data type for the current version of egapx, or some other workaround for this issue? Thank you!