sib-swiss / pftools3

A suite of tools to build and search generalized profiles
GNU General Public License v2.0
10 stars 7 forks source link

Calibrate DNA built profiles #19

Open Ebedthan opened 3 years ago

Ebedthan commented 3 years ago

Hello,

I am trying to build PSSM profiles for DNA sequences. The profile construction ran smoothly. Now I need to calibrate the profile with a database and I really cannot find a way to do that. Can you please show me a way or a database to use?

P.S. The profile was built with partial bacterial DNA sequences from NCBI.

Thanks in advance.

smoretti commented 3 years ago

Hi

to calibrate your PSSM profiles from DNA sequences we recommend to shuffle your partial bacterial DNA sequences using a method such as "windows 20" shuffling, and then use this shuffle database for the calibration.

Ebedthan commented 3 years ago

Okay, thanks @smoretti. But by the way, I've taken time to read the article from Pagni and Jogeneel but I don't clearly know what step or tools to use to shuffle DNA sequence with a method like "windows 20". Please can you help me with the process to create such a shuffle database or point me to interesting resources? I really need it. Thanks in advance.

smoretti commented 3 years ago

In the distribution a script (_src/Perl/scramblefasta.pl) is provided to do it. It can run several types of shuffling. More information with perl scramble_fasta.pl -h

The "windows 20" method should be run with perl scramble_fasta.pl -m window -P 20 a_file_with_all_your_partial_bacterial_DNA_sequences_in_fasta_format

Ebedthan commented 3 years ago

Great thanks to you for your help! I'm trying it.

Ebedthan commented 3 years ago

Again thank you @smoretti for the help and point me to the Perl script. I'll further explore all the files in the pftools2 package.

Ebedthan commented 3 years ago

Hello @smoretti,

I have ran perl scramble_fasta.pl -m window -P 20 bacterial_dna.fa > mywindow20.seq and got the database for the profile calibration. Nevertheless the profiles obtained have a score for both cut off values like SCORE=-2147483648. And running pfscanV3 or pfsearchV3 I got the following error:

Error: Inconsistent alignment found in alignment 3 - no list produced.
       Alignement should be from 1431 to 1!
Thread 0 : Internal error xalip reported no possible alignment for sequence 0(0) (nali=-1)!

It is the first time I see a negative SCORE and I'm trying to know what I'm doing wrong.

Thanks in advance for the help.

smoretti commented 3 years ago

Negative SCORE are possible, mainly when global (not local) profiles are used.

Your case is more tricky. Such very large SCOREs look to be a memory issue: To optimize speed and memory storage, matches in pftoolsv3 are stored on 32bits in memory. When very large profiles are used the storage is exceeded.

Could you retry with less long sequences (and profiles)?

Ebedthan commented 3 years ago

I want to but I'll lose important gene information. I have already used partial gene sequences lower than the full gene size. Is it not possible to find another way? Or perhaps increase the memory storage for DNA profiles?

smoretti commented 3 years ago

Sorry, I missed your message.

In fact by default profiles should be stored in 16bits. If you rebuild pftools3 with this option cmake -DUSE_32BIT_INTEGER=ON profiles will be stored in 32bits. Maybe it will solve your issue.

If it does not solve it, you can try to use less long profiles by splitting them, and build overlapping profiles.

Ebedthan commented 3 years ago

Thanks for your response. While waiting for your response I have taken the option to try to split sequences to build less long profiles and overlapping profiles. I have not gone far meanwhile. Definitively, I'll try both options and see which one can lead me to meaningful results. I'll let you know.