hillerlab / make_lastz_chains

Portable solution to generate genome alignment chains using lastz
MIT License
44 stars 8 forks source link

Error in 'lastz' #21

Closed hongbingp closed 1 year ago

hongbingp commented 1 year ago

Hi!

I'm trying to run make_chains.py for betta fish and human, but I had errors in the last several jobs (~90% completed) of step 'lastz'. The error is like this NOTE: Process execute_jobs (3909) terminated with an error exit status (1) -- Execution is retried (3)

As you suggeset in other similar issues, I applied 4-round RepeatModeler and RepeatMasker for the betta fish genome and download the human genome from https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/ (Which is soft-masked). I also set the --seq1_chunk 50000000 --seq2_chunk 10000000 to reduce the chunk size. But the errors still occurs.

Could you help me figure out why? In addition, is there a way to rerun the failed jobs instead of running the lastz from the beginning?

Thanks! Hongbing

MichaelHiller commented 1 year ago

Hi Hongbing,

RepeatModeler + Masker typically works well, but it may not catch all repetitive structures that are present in the highly-complete genome we have these days (centromers, telomers). One idea is to further reduce the chunksize, if your cluster can handle larger job numbers. Another trick that could help is to run WindowMasker and additionally soft-mask what WindowMasker finds. We had cases where WM better masked tandem repeats in centromers or telomers, that resulted in never-finishing lastz jobs that align the telo/centromer chunk between 2 species.

Michael

hongbingp commented 1 year ago

Hi Michael,

Thank you for your suggestions! I further reduced the chunksize and remasked the betta fish genome. However, the errors still occured. Here are two log files.

slurm-6609863.txt slurm-6618473.txt

Thanks, Hongbing

MichaelHiller commented 1 year ago

Hmm, can you pls post the % of the bases in both assemblies that are soft-masked?

You have about 10000 jobs, which is alright. Also, you align hg38 against the fish, which should not be a problem as both species will not share similar transposons.

Is your betta fish on NCBI? Or can you send it to me? I can only offer to run it on our system and see if that works?

hongbingp commented 1 year ago

I used hard-masked genome for both betta fish and human. 19.97% of betta fish genome was masked. I don't know the percentage for human genome becuase I just downloaded the masked version from UCSC.

Here are the linkes: Human genome https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/latest/hg38.fa.masked.gz Betta fish genome https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_900634795.4/

Thank you for your help!

MichaelHiller commented 1 year ago

Hard masking is not even necessary. I'll give it a try

MichaelHiller commented 1 year ago

Another note: In case you plan to use the chains for TOGA (projecting human genes to the fish), this will likely not work well. TOGA is not meant to work over such large distances.

I'll report how the chain building goes

MichaelHiller commented 1 year ago

Worked without problems for me.

85131593/427001245 = 19.9% of the fish was masked. lastz (286 jobs) took that much time: Time in finished jobs: 591281 sec 9854.7 min 164.2 h 6.8 days Average job time: 2067 sec 34.5 min 0.6 h 0.0 days Longest finished job: 4881 sec 81.3 min 1.4 h 0.1 days

I put the chains, repeat lib, 2bit at https://genome.senckenberg.de/download/forHongbing/

MichaelHiller commented 1 year ago

Pls let me know when you downloaded everything and then close this issue. Thx

hongbingp commented 1 year ago

Thank you so much! It's very helpful!

I used the 2bit file you provided to run the make_chains.py but still failed at the lastz step. Could you share the arguments you used for make_chains.py?

I'm also wondering if you used the soft-masked genome for betta fish and human and ran the RepeatModeler and RepeatMasker for the human genome as well? Would you mind sharing the arguments you used for RepeatModeler and RepeatMasker?

Thanks again! Hongbing

MichaelHiller commented 1 year ago

BuildDatabase -name $assembly $assembly.unmasked.fa RepeatModeler -threads 16 -database $assembly

We typically use these parameters for lastz and for splitting the genome. BLASTZ=/beegfs/software/lastz/lastz-1.04.15/lastz BLASTZ_H=2000 BLASTZ_Y=9400 BLASTZ_L=3000 BLASTZ_K=2400 SEQ1_CHUNK=175000000 SEQ2_CHUNK=50000000

Pls check if your hg38 is masked: twoBitToFa hg38.2bit stdout | faSize stdin 3299210039 bases (161611482 N's 3137598557 real 1503152244 upper 1634446313 lower) in 711 sequences in 1 files

1634446313/3299210039 = 0.495405 (about half is lower case)

hongbingp commented 1 year ago

I figured it out! It's the TMPDIR problem, the default tmp directory in my cluster doesn't have enought space. Have to manually specify the TMPDIR in the DEF file.

Thank you for all your help! Hongbing

MichaelHiller commented 1 year ago

Excellent. Glad to hear.