langmead-lab / monorail-external

examples to run monorail externally
MIT License
13 stars 5 forks source link

waited too long error #30

Closed pabloacera closed 2 months ago

pabloacera commented 2 months ago

Hi, I am running sudo -E /bin/bash singularity/run_recount_pump.sh quay.io/benlangmead/recount-rs5 $SRR_ID local hg38 10 /mnt/data/paceramateos/monorail-external-master /mnt/data/paceramateos/monorail-external-master/SRA_input/$SRR_ID/${SRR_ID}_1.fastq /mnt/data/paceramateos/monorail-external-master/SRA_input/$SRR_ID/${SRR_ID}_2.fastq $SRR_ID

I have run this command successfully before for many other samples but I have never gotten this error before and not sure what to do about it. The results are not in the output folder and I see the following error here output/$SRR_ID_att0/'$SRR_ID!$SRR_ID!hg38!local.align.log'

Apr 10 22:41:23 ..... started STAR run
Apr 10 22:41:23 ..... loading genome

EXITING because of FATAL ERROR: waited too long for the other job to finish loading the genomeSuccess
SOLUTION: remove the shared memory chunk by running STAR with --genomeLoad Remove, and restart STAR
Apr 11 00:21:23 ...... FATAL ERROR, exiting

Any help would be appreciated!!! thanks!

ChristopherWilks commented 2 months ago

Hi @pabloacera,

My informed guess from the error message would be something stemming from residual parts of the genome index still remaining in shared memory.

Monorail leverages STAR's ability to share a single instance of it's genome index using the shared memory scheme in Unix/Linux. This is useful if you run multiple samples (multiple monorail jobs) in parallel on the same machine, avoiding the need to have enough memory to load a new instance of the ~26GB human genome index for every job.

On the machine you're running (but not inside the singularity container), you can check to see if there are segments of STAR's index still residing (something like this after entering the command ipcs):

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x1706000a 7          user       666        1          0
0x17060009 8          user       666        28945729651 0

------ Semaphore Arrays --------
key        semid      owner      perms      nsems

You can remove them via this command:

sudo ipcrm --shmem-key 0x1706000a
sudo ipcrm --shmem-key 0x17060009

or if you're sure nothing else on the system is using shared memory you can just purge the whole thing:

sudo ipcrm --all

I'd try that, then make sure there's no shared segments left, and then re-try that sample

pabloacera commented 2 months ago

It worked! @ChristopherWilks GOAT!

ChristopherWilks commented 2 months ago

thanks for the update @pabloacera, glad it worked