jasperlinthorst / reveal

Graph based multi genome aligner
MIT License
46 stars 3 forks source link

Extract the alignement to a file ? #11

Open ghost opened 7 years ago

ghost commented 7 years ago

Hi Jasper,

I'm interested in aligning bacterial genome (~ 5Mbp so ~5.2Mo each ) but in order to compare different tools, I wonder if there is a way to actually extract the alignment in a text/fasta... format from the graph ?

Otherwise, how did you manage to count the number of inversion or the percentage of identity (as you did for your pre-print) ?

I look forward for your answer.

Regards, Sam

jasperlinthorst commented 7 years ago

Hi Sam, The graph encoded in the resulting GFA file essentially describes the alignment, it should be pretty straightforward to parse. I don't have any scripts that output the graph in a row-column based MSA kind of format. Main reason for that is the fact that a graph (with more than two samples) can encode more complex structures which can not properly be encoded this way...

The amount of identity can be calculated by summing the length of all aligned sequence segments (multiplied by the number of genomes for which that segment was observed, look at the ORI tag for each segment) and divide that by the total amount of sequence that was input into the alignment. Note that the parameters -m, -e, -n, -c etc. have a large impact on the number of segments that are aligned, and thus the identity calculation.

Good luck, Jasper

ghost commented 7 years ago

Hi Jasper,

Thank you for your fast answer and your advices. I will try to parse the GPA file as you said.

I will let you know about what I can get.

Regards, Sam

ghost commented 7 years ago

Hi Jasper,

It was indeed pretty trivial to convert the gfa file into xmfa.

On another note, I ran several alignment on the same dataset of 3 collinear genomes to test the different options of the tool and I noticed that the standard output remains the same although counting the MEMs (depending on the option choosen and the default minimum length) would indicate that the final alignment are different, as expected. Note that the standard output does change depending on the dataset. Is it just an error or have I misunderstood ?

Regards, Sam

jasperlinthorst commented 7 years ago

Hi Sam, Which options did you change? There are various options for which a change within a 'reasonable' range might not affect the resulting graph. I should write some more documentation on what these options actually do and which are useful to look in to, but I haven't had time to do so yet. However, if you change let's say the -m option this should affect the resulting graph (and thus the alignment..)...

Cheers, Jasper

fbemm commented 7 years ago

Hi @samaln,

could you share the approach you used to generate a XMFA from a reveal GFA?

Thanks, Felix

ghost commented 7 years ago

Hi @jasperlinthorst and @fbemm, The approach I used so far is quite limited since I just parsed the gfa file and applied the options as constrains for the segment to be write in the xmfa. For instance, I am joining the little perl script I am when applying only the -m minimum MEM length option (to run it : perl gfa_conversion_ang.txt , then enter the value for -m). gfa_conversion_ang.txt

I think it shouldn't be too difficult to add other constrains such as -n, -g ... since you can play with the values given by ORI:Z: . But I have no idea how u can manage to apply constrains from options as -c or -e that @jasperlinthorst mentioned above, which would be undoubtedly very interesting.

Please let me know if you both have any remark or idea about that approch.

@jasperlinthorst, I ran other alignments with the -m options and the standard output actually changes. But not with -g -x -s -c options I always get the following statement showed on my screen for all those different commands. Here is what I get :

1/ -s option : $ reveal align -o UAB_targetsamplesU -s U00096_GR.fsa ../donnees_3coli/U00096_GR.fsa ../donnees_3coli/AE014075_GR.fsa ../donnees_3coli/BA000007_GR.fsa and $ reveal align -o UAB_targetsamplesB -s BA000007_GR.fsa ../donnees_3coli/U00096_GR.fsa ../donnees_3coli/AE014075_GR.fsa ../donnees_3coli/BA000007_GR.fsa

2/ -x option $ reveal align -o UAB_maxsamples1 -x 1 ../donnees_3coli/U00096_GR.fsa /home/saah/montages_reseau../donnees_3coli/AE014075_GR.fsa ../donnees_3coli/BA000007_GR.fsa and $ reveal align -o UAB_maxsamples2 -x 2 ../donnees_3coli/U00096_GR.fsa /home/saah/montages_reseau ../donnees_3coli/AE014075_GR.fsa ../donnees_3coli/BA000007_GR.fsa and $ reveal align -o UAB_maxsamples3 -x 3 ../donnees_3coli/U00096_GR.fsa /home/saah/montages_reseau ../donnees_3coli/AE014075_GR.fsa ../donnees_3coli/BA000007_GR.fsa

3/ -g option $ reveal align -o UAB_minsamples1 -g 1 ../donnees_3coli/U00096_GR.fsa /home/saah/montages_reseau ../donnees_3coli/AE014075_GR.fsa ../donnees_3coli/BA000007_GR.fsa and $ reveal align -o UAB_minsamples2 -g 2 ../donnees_3coli/U00096_GR.fsa /home/saah/montages_reseau ../donnees_3coli/AE014075_GR.fsa ../donnees_3coli/BA000007_GR.fsa and $ reveal align -o UAB_minsamples3 -g 3 ../donnees_3coli/U00096_GR.fsa ../donnees_3coli/AE014075_GR.fsa ../donnees_3coli/BA000007_GR.fsa

4/ -c option $ reveal align -o UAB_minscore5 -c 5 ../donnees_3coli/U00096_GR.fsa /home/saah/montages_reseau ../donnees_3coli/AE014075_GR.fsa ../donnees_3coli/BA000007_GR.fsa and same with -c 0 , -c 2 , -c 10, -c 20

The standard output I get each time : 08/02/2017 06:35:27 PM Loading input... 08/02/2017 06:35:27 PM Reading fasta: ../donnees_3coli/U00096_GR.fsa ... 08/02/2017 06:35:27 PM Reading fasta: ../donnees_3coli/AE014075_GR.fsa ... 08/02/2017 06:35:28 PM Reading fasta: ../donnees_3coli/BA000007_GR.fsa ... 08/02/2017 06:35:28 PM Constructing index... Sorting suffixes... Done. Filling inverse suffix array... Done. Compute LCP... Done. 08/02/2017 06:35:31 PM Done. 08/02/2017 06:35:31 PM Constructing multi-alignment... 08/02/2017 06:37:13 PM Merging nodes... 08/02/2017 06:37:16 PM Done. 08/02/2017 06:37:16 PM U00096_GR.fsa-AE014075_GR.fsa-BA000007_GR.fsa (71.28% identity, 10811364 bases out of 15168129 aligned, 96165 nodes out of 201427 aligned). 08/02/2017 06:37:16 PM Writing graph... 08/02/2017 06:37:21 PM Done. 08/02/2017 06:37:21 PM Alignment graph written to: UAB_minscore20.gfa

As I understand the way reveal works, there should be at least slight differences on the number of bases aligned. Sorry for the bothering but I would like to use those figures since I cannot properly convert the file.

Best Regards, Sam

jasperlinthorst commented 7 years ago

Hi Sam, Sorry for replying to this after so long. These parameters need proper implementations, documentation, testing and some rethinking on my part, but I'll try to write down the general idea behind some of them...

-c: This parameter determines whether a match in an alignment is incorporated into the graph. For instance, if you input two fasta files and there's a match of 1000bp between the two genomes and the indel penalty (given some penalty scheme) for this match adds up to 200. Then I calculate the score for this match as (10003)-(2001)=2800, where the 3 and the 1 come from the --ws and --wp parameters. Now if you set -c to a value higher than 2800 this match will not be incorporated into the graph. By default, this value is not considered, but setting it to a high number should change the output.

-g, -x and -s: These parameters should apply when you align a sequence against a graph, where in the graph, only nodes that satisfy these parameters are considered for alignment (meaning they can be mutated by the alignment). So, if in you case, you are aligning only sequences these parameters are not considered and therefore won't change the output, so that should explain your findings.

I recently added the 'stats' subcommand to calculate some statistics on the alignment graph, so maybe that could be a starting point for your identity calculation. Also, I'd suggest making some visualisations with 'reveal plot' or 'reveal gplot' to get an idea of how much of your genomes are actually aligned in the graphs that you generate and if there are any major structural events going on. In case, these can be addressed with the 'finish --order=chains' subcommand, but this is still quite preliminary for now.

Let me know, Jasper