parameter for distant species and large genome

AlisaGU commented 2 years ago

Hi, I have several genomes whose divergence time is between 110 MYA and 295 MYA. These size are quite differerent from 3G to 27G.

At first, I use

lastdb -P20 -uRY32 $reference_abbre $reference_genome
last-train -P20 --revsym -E0.05 -C2 $reference_abbre $query_whole_genome_sequence > ${train_outfile}

# split the large query genome to several parts and run lastal separately
$lastal -P30 -E0.05 -C2 -p $train_outfile $reference_db $query_sequence|${last_split} -fMAF+ > ${many_to_one_maf}

# cat these many to one maf to a big many to one maf
# many to one to one to one
$last_split -r -m1e-5 $many_to_one_maf | $last_postmask > $one_to_one_maf

but the result is so so small.

I later use the example (xentro vs hg38) posted in your github.

genome1="xenTro.fa"
genome2="homo.fna"
# VARIABLE NAMING for test module TODO:

# PROCESS TODO:
# set -x
$lastdb -P20 -uMAM8 myDB $genome1

$last_train -P20 --revsym -D1e9 --sample-number=5000 myDB $genome2 >my.train

$lastal -P20 -D1e9 -m100 -p my.train myDB $genome2 | $last_split -fMAF+ >many-to-one.maf

$last_split -r many-to-one.maf | $last_postmask >out.maf

It's extremely slow(the code has run 3d+), and it's seems these parameter are not suitable for large genome.

I am confused to select these parameters. Could you give some tips?

All the best,

mcfrith commented 2 years ago

If the genome that you give to lastdb is bigger than 4G, I would use lastdb5.
-RY32 makes it very high-speed and low-sensitivity, -uMAM8 the opposite. Try something in between, e.g. -uRY4.
To make it faster, I would omit -m100.
To make it faster, I would use lastal option -C2.

I hope that helps!

AlisaGU commented 2 years ago

Thanks for your reply, I will try again ヾ(´▽‘)ﾉ

AlisaGU commented 2 years ago

I used the alignment between xentro and human as an example.

$lastdb -P20 -uRY4 $reference_abbre $reference_genome
${last_train} -P20 --revsym -D1e9 --sample-number=5000 $reference_genome_dir/$reference_abbre $query_whole_genome_sequence >${train_outfile}
$lastal -P20 -D1e9 -C2 -p $train Xtro $genome2 | $last_split -fMAF+ >many-to-one.maf

$last_split -r many-to-one.maf | $last_postmask >out.maf

Data I used :

Xenopus tropicalis: GCF_000004195.4_UCB_Xtro_10.0_genomic.fna
hg38: GCF_000001405.39_GRCh38.p13_genomic.fna

After change lastdb parameter to uRY4, code run fastly but result is still small(27M) compared to you posted (66M). Is decreased size normal?

mcfrith commented 2 years ago

xentro (frog) and human genomes are extremely diverged from each other, so I suppose a much smaller result is not too surprising. Maybe you can try an intermediate level of speed-versus-sensitivity. For example, if you don't use any -u option (so no -uRY4 or -uMAM8), it will be more sensitive than -uRY4. Or you could try -uMAM4, which is not quite as slow as -uMAM8.

AlisaGU commented 2 years ago

ok, let me try

AlisaGU commented 2 years ago

I find another situation: the one-to-one maf of whole query genome(38M) is larger than that of splited genome(split and cat, 27M). Is using the whole query genome the best way for pairwise genome alignment?

mcfrith commented 2 years ago

For "split and cat", the cat must be done before last-split -r. Then the results should be identical to using the whole query.

AlisaGU commented 2 years ago

This is my split and cat codes:

# make database for reference
$lastdb -P20 -uRY4 $reference_abbre $reference_genome

# create train file for whole query genome
${last_train} -P20 --revsym -D1e9 --sample-number=5000 $reference_genome_dir/$reference_abbre $query_whole_genome_sequence >${train_outfile}

# split whole query genome into 9 parts and run following code for each one.
$lastal -P20 -D1e9 -C2 -p $train_outfile $reference_db $query_sequence | ${last_split} -fMAF+ >${many_to_one_maf}

# cat 9 many-to-one mafs into a whole many-to-one maf
for i in $(seq 1 9); do
    cat part${i}/${reference_abbre}${query_abbre}.${i}.maf >>$global_many_to_one_maf
done

 # many-to-one to one-to one
$last_split -r -m1e-5 $global_many_to_one_maf| $last_postmask >$one_to_one_maf

It seems no error.

By the way, I tried no -u option and uMAM4. No -u option achieves the best balance beteen sensitive and time. Compared to uMAM8, result of no -u option is 61M and that of uMAM4 is 63M. But the time cost of uMAM4 is almost 4 times.

mcfrith commented 2 years ago

Your split and cat seems correct, and it should give identical results to using the whole query genome. If it doesn't, that's strange... I can only suggest checking you didn't make any mistake.

By the way, you can do the cat without a loop: cat part*/${reference_abbre}${query_abbre}.*.maf >$global_many_to_one_maf

(Could the problem be that there was already a $global_many_to_one_maf file before your cat loop?)

AlisaGU commented 2 years ago

I know what the problem is. It's my mistake. I split genome incompletely and sequences used for alignment just accout for only 75% of the whole genome. After correction, everything goes OK!

cat part*/${reference_abbre}${query_abbre}.*.maf >$global_many_to_one_maf Thanks for this tips.

All problems are solved. Thanks for your timely help and LAST. ｡:.ﾟヽ(｡◕‿◕｡)ﾉﾟ.:｡+ﾟ

AlisaGU commented 2 years ago

Enmmm，although the many-to-one mafs' size of whole genome and split-cat are the same, but the final one-to-one maf is still different with whole genome(38M) larger than split-cat (32M). Is it right?

mcfrith commented 2 years ago

They should be identical...

mcfrith / last-genome-alignments

parameter for distant species and large genome #14