P.evermanni ShortStack problem - IndexError: list index out of range during "Screening of possible de novo microRNAs"

kubu4 commented 3 months ago

I've encountered the above error when using ShortStack. I've already described the issue on the developer's repo, but haven't gotten a response yet. If anyone is willing to glance at the details shown in that issue, I'd greatly appreciate it.

The truncated version of the error is:

Screening of possible de novo microRNAs
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/sam/programs/mambaforge/envs/ShortStack-4.0.3_env/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/sam/programs/mambaforge/envs/ShortStack-4.0.3_env/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/sam/programs/mambaforge/envs/ShortStack-4.0.3_env/bin/ShortStack", line 2391, in mir_analysis
    new_locus, s_start, s_end = analyze_fold(mir_locus, dotbracket, bedfields, merged_bam, args)
  File "/home/sam/programs/mambaforge/envs/ShortStack-4.0.3_env/bin/ShortStack", line 2954, in analyze_fold
    new_dotbracket = foldlines[2].rstrip().split(' ')[0]
IndexError: list index out of range
"""

The odd thing is this error is only occurring using the P.evermanni genome and only when using R1-only reads or merged reads.

ShortStack runs fine using unmerged R1 and R2 reads.

I'd also like to add that ShortStack works without any issues, regardless of input reads (R1/R2, R1 only, merged), on the two other species we've looked at.

If anyone has any suggestions as to how to approach this, I'd greatly appreciate it.

Maybe I'll try each individual FastQ file, one at a time, and see if there' some problematic read(s) in one of them?

sr320 commented 2 months ago

Assuming a single file still fails can you please post a link to one single file that fails and one single file from another species that works? Must be in fastq formatting?

kubu4 commented 2 months ago

These worked with P.evermani and ShortStack previously:

Paired-end reads

The following do NOT work with P.evermani and ShortStack:

R1 only reads

https://gannet.fish.washington.edu/Atumefaciens/gitrepos/deep-dive/E-Peve/output/06.1-Peve-sRNAseq-trimming-R1-only/trimmed-reads/sRNA-POR-73-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.fq.gz

Merged reads

https://gannet.fish.washington.edu/Atumefaciens/gitrepos/deep-dive/E-Peve/output/06.2-Peve-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads/POR-73-S1-TP2-fastp-adapters-polyG-31bp-merged.fq.gz

The following work with P.meandrina and ShortStack:

R1 only reads

https://gannet.fish.washington.edu/Atumefaciens/gitrepos/deep-dive/F-Pmea/output/08.1-Pmea-sRNAseq-trimming-R1-only/trimmed-reads/sRNA-POC-47-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.fq.gz

Merged reads

https://gannet.fish.washington.edu/Atumefaciens/gitrepos/deep-dive/F-Pmea/output/08.2-Pmea-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads/sRNA-POC-47-S1-TP2-fastp-adapters-polyG-31bp-merged.fq.gz

sr320 commented 2 months ago

Have you considered going back with evermani from raw -> merged..

kubu4 commented 2 months ago

I don't I follow.

Skip trimming?

sr320 commented 2 months ago

No - just doing everything again - pretending you just got the raw data.

Inspected fastq files and did not notice any difference- though multiQC still worth a shot.

The main differences between the two FASTQ files you've provided are:

Index Sequence: Each read entry has a tag that indicates which sample or experiment it belongs to. In the first file, the reads are tagged with the index GGTAGCAT, while in the second file, the index used is GTGGCCAT.
Read Length Variability: Both files show variability in read lengths, but they handle this variability differently. For example, while both files contain reads of varying lengths, the specific lengths and how frequently they occur differ between the files.
Quality Score Variations: The FASTQ format includes a quality score for each base, providing information about the probability of an incorrect base call. The two files show some variations in these quality scores (denoted by characters such as F, :, #, etc.). This could suggest differences in sequencing quality or different sequencing technology settings or calibration.

These differences can affect downstream analysis, such as alignment and quantification, and should be considered during data preprocessing and analysis.

kubu4 commented 2 months ago

No - just doing everything again - pretending you just got the raw data.

Gotcha.

multiQC still worth a shot.

What do you mean by this?

Pretty sure there's FastQC/MultiQC reports for all FastQs, at all stages.

sr320 commented 2 months ago

Did re wrangling from raw data impact anything?

kubu4 commented 2 months ago

Nope.

I've re-trimmed, as well as modified trimming parameters (mostly final read length), and it hasn't made a difference.

I've added a bunch of print statements to the source code to try to see precisely where things are going awry. This has sort of been useful, but not really. I know which variable(s) aren't populating, leading to the error, but it's a slog to work my way backwards through the code to try to figure out which input file(s) are being processed, so that I might be able to look through those.

Admittedly, I'm getting a bit burnt out on trying to troubleshoot this. It's very tedious and time consuming. Each time I have to re-run the code takes about 20mins to get to the error.

sr320 commented 2 months ago

I am going to try a few things - if that does not work we will move to just R1 - will send you files to try today.

kubu4 commented 2 months ago

R1 only doesn't work with P.evermanni, either...

sr320 commented 2 months ago

First in series of files to try - Groomed 47 merged https://usegalaxy.org/api/datasets/f9cad7b01a47213562fe7b27a2369930/display?to_ext=fastqsanger

sr320 commented 2 months ago

have we considered evermanni genome is the problem? - run evermanni reads on different genome.

kubu4 commented 2 months ago

have we considered evermanni genome is the problem?

Yes, definitely considered. However, we CAN run ShortStack with P.evermanni with the "original" trimming params. I even re-ran ti this week to confirm that it (still) works.

run evermanni reads on different genome.

Sure, I'll do this!

Will shortstack work on fasta?

Just glanced at documentation, and yes, FastA formatted reads are accepted as inputs.

kubu4 commented 2 months ago

Groomed 47 merged

Was this done with trimmed reads?

kubu4 commented 2 months ago

Groomed 47 merged

Was this done with trimmed reads?

Eh! Never mind. Just looked at FastQ and see that they're trimmed to 25bp?

sr320 commented 2 months ago

Here is a fasta of 73 based on your merged fastq - https://usegalaxy.org/api/datasets/f9cad7b01a472135fe471ba2ddeb7983/display?to_ext=fasta.gz

give that a try

sr320 commented 2 months ago

Here is a Groomed 73 merged to try - https://usegalaxy.org/api/datasets/f9cad7b01a472135de52bdc05fda83a2/display?to_ext=fastqsanger

sr320 commented 2 months ago

Here is new interlaced (merged) 73 https://usegalaxy.org/api/datasets/f9cad7b01a472135fc6850d35af09e48/display?to_ext=fastqsanger.gz

kubu4 commented 2 months ago

First in series of files to try - Groomed 47 merged https://usegalaxy.org/api/datasets/f9cad7b01a47213562fe7b27a2369930/display?to_ext=fastqsanger

Completed successfully.

Original trim length was 25bp,which also had run successfully. So, maybe something there?

I tried a 30bp trim yesterday, which failed...

sr320 commented 2 months ago

Here is new joined (merged) 73 https://usegalaxy.org/api/datasets/f9cad7b01a472135efc8e5f9433aa51b/display?to_ext=fastqsanger.gz

kubu4 commented 2 months ago

Here is a fasta of 73 based on your merged fastq - https://usegalaxy.org/api/datasets/f9cad7b01a472135fe471ba2ddeb7983/display?to_ext=fasta.gz

give that a try

This one failed with same error.

kubu4 commented 2 months ago

@sr320 successfully ran one of the original faspt 31bp merged reads (I belive the 73 sample) on his laptop via command line. Additionally, the developer responded to my issue (GitHub issue) and was successful in running all three samples on his computer, via command line.

So, I'll give the command line a rip on Raven and see how it goes. If that fails, I'll run in on my laptop and see how it goes. And, if that all fails, we know @sr320 can run on his computer, if needed.

kubu4 commented 2 months ago

Unbelievably, this ran successfully on raven, via the command line!

So weird...

kubu4 commented 2 months ago

Gah! Spoke too soon!!!

The command @sr320 (as well as the developer) omitted an option I had been running --dn_mirna for de novo sRNA prediction. Once I add that back in, the command fails... :cry:

EDITED: Fixed option.

sr320 commented 2 months ago

And if we leave out?

kubu4 commented 2 months ago

Then, the results aren't comparable to the other two species we've run?

sr320 commented 2 months ago

Go ahead and try other files above I made and also with different genome if not already done.

Steven B. Roberts, Professor School of Aquatic and Fishery Sciences University of Washington Fisheries Teaching and Research (FTR) Building - Office 232 1140 NE Boat Street - Seattle, WA 98105 robertslab.info https://faculty.washington.edu/sr320/ - @.*** - @sr320 vm:206.866.5141 - cell:360.362.3626 schedule a zoom call: https://d.pr/PfBNav

On Wed, May 8, 2024 at 4:00 PM kubu4 @.***> wrote:

Then, the results aren't comparable to the other two species we've run?

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/urol-e5/deep-dive/issues/39*issuecomment-2101634525__;Iw!!K-Hz7m0Vt54!hdC3F1NkcHjJ8AIEadSFZ9hp4XieU6XujDqRbYNIjpR6i6EJWwmDGhrZULmw6FpHXA4kUTzXLkHRN-hqtAX3Edo$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN5C5MNCE2TA2OWMMSDZBKVBPAVCNFSM6AAAAABFEA4SHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBRGYZTINJSGU__;!!K-Hz7m0Vt54!hdC3F1NkcHjJ8AIEadSFZ9hp4XieU6XujDqRbYNIjpR6i6EJWwmDGhrZULmw6FpHXA4kUTzXLkHRN-hqQFcrSJU$ . You are receiving this because you were mentioned.Message ID: @.***>

kubu4 commented 2 months ago

Amazingly, the developer found the bug and has fixed it! (GitHub Issue)

He's indicated he'll put in in the next release, which will be "soon." Not sure how long that actually means.

I'll glance at the fix and see if I can incorporate the changes myself.

kubu4 commented 2 months ago

Alrighty, I implemented the changes mentioned in the developer's comment and have successfully run ShortStack on P.evermanni!

urol-e5 / deep-dive