ncbi / egapx

Eukaryotic Genome Annotation Pipeline-External caller scripts and documentation
Other
19 stars 1 forks source link

RNA-seq & Array issue #10

Closed znelson999 closed 1 month ago

znelson999 commented 1 month ago

Hello,

I'm attempting to include RNA-seq data into my run and am getting an array error. The sequences are locally hosted and not in NCBI. Included is the error I'm getting and the yaml file being used.

yamlFileUsed.txt

EGAPissue.txt

The wording used in the documentation says that RNA-seq files have to end in 1 or 2; could that be causing my problems? If so is there a work around to be able to use data that is not on NCBI? Or could it just be an error in the way I've set up the required files?

Thank you for any assistance you can provide.

pstrope commented 1 month ago

Hi, Thank you for reporting.

Can you please post the command you used?

thanks, Pooja

znelson999 commented 1 month ago

Here you are

module load python_3/3.11.1

python -m venv /project/cricket_gen/ZachN/Virtual/EGAP source /project/cricket_gen/ZachN/Virtual/EGAP/bin/activate pip install -r /project/cricket_gen/ZachN/EGAPx/egapx-main/ui/requirements.txt

module load nextflow/23.04.3 module load singularityCE/3.11.4

python3 /project/cricket_gen/ZachN/EGAPx/egapx-main/ui/egapx.py /project/cricket_gen/ZachN/EGAPx/egapx-main/examples/input_D_farinae_small.yaml

python3 /project/cricket_gen/ZachN/EGAPx/egapx-main/ui/egapx.py /project/cricket_gen/ZachN/EGAPx/egapx-main/yamlFiles/oldTenebrioMol.yaml -e singularity -w /project/cricket_gen/ZachN/Virtual/WorkingDirectory -o /project/cricket_gen/ZachN/Virtual/EGAPOutputEditedTenebrioMol

The following is submitted via .sh.

pstrope commented 1 month ago

Currently, the read files have to end in .1, .2 to be paired up. In future, we will make it more flexible.

To be able to read the gz read files, edit the ui/assets/default_task_params.yaml: Under star_wnode: is a -star-params To the argument list, add this --readFilesCommand zcat

pstrope commented 1 month ago

Also python3 /project/cricket_gen/ZachN/EGAPx/egapx-main/ui/egapx.py /project/cricket_gen/ZachN/EGAPx/egapx-main/examples/input_D_farinae_small.yaml -o outdir for your first egapx run.

znelson999 commented 1 month ago

Hello,

I have modified the default_tasks_params.yaml file but am still getting an index issue. The error I'm getting is the same as before

ERROR ~ index is out of range 0..-1 (index = 0)

Any other things I can try to fix this?

pstrope commented 1 month ago

Hi, The best thing to do is to wait for our next version update. We are actively working on this issue that your brought up. Currently, it's also having trouble reading gz read files.

We appreciate your testing and reporting. We'll reach out when the fix is ready.

Pooja

znelson999 commented 1 month ago

Very well, thank you for assistance.

victzh commented 1 month ago

@znelson999 can you supply me with a short sample of the FASTQ files you use - say first 100 lines of first 4 files in your list (after zcat'ting it of course)? By using something like zcat /project/cricket_gen/ZachN/EGAPx/TenebrioMolTranscriptome/Tmol_RNA_transcriptome_lifestages/t3_ll_1_R1.fastq.gz | head -100 > t3_ll_1_R1.fastq.sample I want to ensure that our code works with the output from a real sequencing machine (which I assume it is) vs. processed reads from SRA.

znelson999 commented 1 month ago

Hi @victzh

Attached is a zip file with the portions you requested, let me know if this helps.

fastqsamples.zip

victzh commented 1 month ago

@znelson999 thanks! It seems to me that t3_ll_1_R1.fastq.gz and t3_ll_1_R2.fastq.gz are parts of a paired run. Why the samples then match each other exactly? Shouldn't they be different ends of the same piece of RNA? I'm not a biologist, I'm a programmer, so pardon my ignorance.

znelson999 commented 1 month ago

@victzh

Thanks for pointing this out. I made a mistake when making those sample files. Attached are the appropriate zipped files. RealFastqFiles.zip

victzh commented 1 month ago

@znelson999 thanks, it helps a lot. I tried to run newer version with your data but it failed (so far) because the samples are too short. We need to fix this and will see what else fails.

zilov commented 1 month ago

Hello! I'm encountering same "index is out of range" error while running egapx with local RNA-Seq data. I've tried renaming my unzipped files to {prefix}.1, {prefix}.2, {prefix}.1.fastq, and {prefix}.2.fastq, but the error persists.

Here's the error message:

ERROR ~ index is out of range 0..-1 (index = 0)

 -- Check script 'nf/./subworkflows/ncbi/./rnaseq_short/star_wnode/main.nf' at line: 83 or see '/media/eternus1/data/vgp/glis_glis/users/zilov/annotation/egapx/egapx/glis_out_last/nextflow.log' file for more details

I installed egapx following the README instructions and am running it within the nextflow conda environment. Test run using the example data completed successfully.

Is there a way to troubleshoot this now? Should I wait for the next version of the tool, or is the problem likely on my side?

Input file: input_yaml2.txt

pstrope commented 1 month ago

Hi @zilov It looks like the underscores in the filenames is causing the problem. We are working on this. For now, you could remove the underscores and give it a try. Pooja

zilov commented 1 month ago

Thank you, that works!