mcfrith / last-genome-alignments

47 stars 5 forks source link

align one genome to another fragmented genome #20

Open AlisaGU opened 3 months ago

AlisaGU commented 3 months ago

Hi, I have two huge genomes (ref: 48G; query: 20G). To obtain an accurate genome alignment, I tried to annotate the transposable element (TE) and filter the non-TE sequence longer than 50bp to be the reference genome. Unexpectedly, the last-train step was too slow and I had to kill it after 7 days run.

I tried to reverse the ref and query, and things seemed worse. Nothing was outputted after two days run.

Could you give me some tips to run? Can I ignore the last-train step and align them directly? Or is there a better way?

Best regards,

mcfrith commented 3 months ago

Please can you show your commands/options for lastdb and last-train, and also the version (e.g. lastdb --version)?

AlisaGU commented 3 months ago

Sure. Version: lastal 1542 lastdb command: $lastdb -P 20 ${reference_abbre} ${reference_genome} last-train: ${last_train} -P 20 --revsym -D1e9 --sample-number=5000 ${reference_abbre} ${query_genome_sequence} >${train_outfile}

the distribution of ref genome after remove the TE: image

genomic fragment less than 50 bp will be filtered.

mcfrith commented 3 months ago

Thanks! Since last-train only uses a sample of the query, I wouldn't expect it to be so slow. I guess the slowness may be caused by running out of memory.

I basically suggest following the "Aligning human & chimp genomes" recipe here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-cookbook.rst For your huge genomes, I would add this lastdb option: --bits=4.

In the recipe, -uRY128 reduces the run time and memory use. But it lowers the sensitivity, which is fine for closely-related genomes, but not distantly-related ones. If your genomes are distantly-related, you could try something like -uRY4 or -uRY8.

I guess it's not necessary to remove TEs (but I don't know for sure).

AlisaGU commented 3 months ago

TE accounts for about 90% of the ref genome, and the removal of TE is for the speed-up of genome alignment. So the slow is unexpected. So, is it faster to use the whole genome with no TE removal?

mcfrith commented 3 months ago

Sure, I would expect removing 90% TEs to be faster.

AlisaGU commented 3 months ago

However, removing 90% TE is slower for the last-train step, and I have no idea about how to deal with that.

Can I ignore last-train and run the lastal step directly?

mcfrith commented 3 months ago

Yes, you can ignore last-train, and run lastal directly. Then it will use some default, non-trained parameters. Which might work quite well, or badly, depending on your data.

But last-train should be much faster than the alignment step, whether you remove TEs or not... (I wouldn't use -D1e9 --sample-number=5000.)

AlisaGU commented 3 months ago

ok, let me try the last-train without -D1e9 --sample-number=5000

AlisaGU commented 3 months ago

I wouldn't use -D1e9 --sample-number=5000

It's also slow.

I have a train file using the whole reference genome and query genome before. Can I use this train file as the lastal input?

mcfrith commented 3 months ago

Yes, that train file sounds fine.

AlisaGU commented 3 months ago

Thanks!