issue in the lastz step

lnyawen commented 10 months ago

Hi,

I use this pipeline to create genome alignment file with local mode. Here is my script make_chains.py human macAss hg38.2bit macAss.2bit --project_dir human_macAss --executor_queuesize 40

It ran fine at first with no problems. However, I've noticed that I seem to be experiencing some problems, reporting errors as follows

executor >  local (1158)
[cc/0029ca] process > execute_jobs (1225) [ 59%] 1118 of 1875

executor >  local (1159)
[3f/96e7bf] process > execute_jobs (1081) [ 59%] 1119 of 1875

executor >  local (1160)
[cf/920417] process > execute_jobs (1082) [ 59%] 1119 of 1875

executor >  local (1160)
[bb/e4ed9a] process > execute_jobs (26)   [ 59%] 1120 of 1876, failed: 1, ret...
[bb/e4ed9a] NOTE: Process `execute_jobs (26)` failed -- Execution is retried (1)

executor >  local (1161)
[91/7d94e9] process > execute_jobs (1083) [ 59%] 1121 of 1876, failed: 1, ret...
[bb/e4ed9a] NOTE: Process `execute_jobs (26)` failed -- Execution is retried (1)

executor >  local (1161)
[91/7d94e9] process > execute_jobs (1083) [ 59%] 1121 of 1876, failed: 1, ret...
[bb/e4ed9a] NOTE: Process `execute_jobs (26)` failed -- Execution is retried (1)
[78/143ab2] process > execute_jobs (1085) [ 59%] 1123 of 1876, failed: 1, ret...

executor >  local (1163)
[78/143ab2] process > execute_jobs (1085) [ 59%] 1123 of 1876, failed: 1, ret...

This is supposed to happen in the lastz step, and I noticed that these errors were reported after the ************ HgStepManager: executing step 'lastz' Sun Oct 15 09:08:53 2023. hint.

And there are 4 errors happening now,

[35/91e12f] process > execute_jobs (1394) [ 72%] 1370 of 1879, failed: 4, ret...

executor >  local (1411)
[64/039d77] process > execute_jobs (1395) [ 72%] 1371 of 1879, failed: 4, ret...

executor >  local (1411)
[64/039d77] process > execute_jobs (1395) [ 72%] 1371 of 1879, failed: 4, ret...

executor >  local (1412)
[c3/9f78eb] process > execute_jobs (1396) [ 73%] 1372 of 1879, failed: 4, ret...

what should I do to fix this?

Thank you in advance for your help.

Yawen

kirilenkobm commented 10 months ago

Hi Yawen,

I am sorry for the prolonged silence, I've been on a rather intensive business trip. It seems like the issue might be related to the cluster jobs time limit; I'll need some more time to look into it.

lnyawen commented 10 months ago

Hi @kirilenkobm ,

Thanks for your reply. That's alright. I'm glad to hear from you now.

I also realized that this appears to be related to cluster runtime or compute resource limitations. The commands mentioned above were run by me on a large node of the cluster using local mode. Also, I used the cluster mode,--executor pbs --executor_partition core28, and all the other tasks have run and finished very quickly, but I noticed that one task has been running, and this task automatically results and resubmits another task after two or three days of running, and with a similar error: [23/0e5576] NOTE: Process execute_jobs (2393) terminated with an error exit status (143) -- Execution is retried (1)

kirilenkobm commented 10 months ago

Got it, some chromosomes containing many repeats may take much longer. What you can try is to reduce the chunk size for reference or the query, like 10 times smaller. However, a better and long term solution is to implement some argument to split selected chromosomes into smaller chunks, without touching the rest. @MichaelHiller what do you think?

lnyawen commented 10 months ago

If I try to reduce the chunk size for reference and query, should I use the parameters --seq1_limit and _--seq2_limit_? And what is the best size for these two parameters?

One more question, if I split the chromosome into smaller chunks, will it lead to a change in the result of calculating TOGA later. I mean if splitting into smaller chunks will it cause some genes to be identified as Missing if a gene happens to be located where it is split.

MichaelHiller commented 10 months ago

We typically use 175Mb chunks for the reference and 50 Mb chunks for the query. Too long lastz jobs are typically the result of unmasked repeats. Sometimes adding windowMasking helps. And reducing chunksize to say 50 vs 10 Mb or smaller may also help.

This step only affects the lastz (all vs all local alignment) step. The downstream chaining step should give the same results. And TOGA uses these chains. So TOGA results should not be affected by chunksize.

lnyawen commented 10 months ago

Thank you for your reply.

You pointed out that the problem might be due to the result of unmasked repeats. this is how I got the result of masking, using Repeatmasker to soft mask the genome based on the library identified by Repeatmodeler.

Then, in order to further mask the genome, I need to use Windowmasker on top of the softmasked genome just mentioned, right?

My idea is to use Windowmasker first to further mask the genome, and if that doesn't work, I then reduce the chunksize. What do you think? Or just reduce chunksize

MichaelHiller commented 10 months ago

Right, Windowmasker on top of the repeatMasker softmasked genome. You may only add WM for the scaffolds that cause problems. They likely have well assembled centromers, which are not correctly masked by RM.

The quickest would be keeping the genome as is and reducing chunkSize. Simultaenously, you can run WM and use the additional masking if a smaller chunksize doesn't do it.

lnyawen commented 10 months ago

Ok, thank you for your help!

hillerlab / make_lastz_chains

issue in the lastz step #36