gifford-lab / GEM3

GEM software suite (GEM, GPS, KMAC, KSM, RMD, etc.)
10 stars 5 forks source link

How to get the coordinates from KSM results #7

Open xubeisi opened 6 years ago

xubeisi commented 6 years ago

After I run java -Xmx10G -jar ~/opt/gem.v3.4/gem.jar KSM --fasta test.pk.fa --g ~/opt/gem.v3.4/hg19.chrom.sizes --ksm test.ksm.lst --out test.pk.gemscan

I got "test.pk.gemscan.motifInstances.txt", however, I failed try to convert the results back to the coordinates by the position stored in fasta file.

For instance:

chr1:6259452-6259796(.) TACAACGTGCTGGGATCGGCAGGGGCTCTGGCCCGCTCCGGCCGACCTGCCAGCCCACCCCAGCAGGACGCTGCAGGGCGCCGTCCCCAGCGAGCCTGGGTAGATGCCGGGCTCGGCGAGGCCCACGTGCCTCCCCTGGAGCCGAGGCCTCACGCGGAGCCATACTAACCACAGGAGCCATGGCGGCAGCGGAGTTAGAAAGGGAGGTGAGCGAACTACGCAGACGCAAAGAGCCCGCAGCGCGCAAGGCACGCAGGGTCCAGGCCGCACTAATCACTTTGCCACGCCCCTCGTCCGCCACCTTTTCTCTTGGTTATGTACGATAGGGGAGCGATTGGTTTTTC

got motifs at the same location for two motifs

Motif SeqID Motif_Name SeqName Match SeqPos Coord Strand Score 0 0 noc_gem_test_2.m0_c1723 chr1:6259452-6259796(.) ATCCCAGCAC:ATCCCAGC,ATCCNAGCA,TCCCAGCA,ATCCCAGNA,ATCCCAGNNC, 331 1:6259812:- - 1.24 1 0 noc_gem_test_2.m1_c1799 chr1:6259452-6259796(.) ATCCCAGCAC:ATCCCA,CAGCAC, 330 1:6259794:- - 1.37

But one SeqPos is 330 while another is 331.

their offset values in the KSM motif file seems the same for ATCCCAGCAC as 2.

so how exactly I could find out how to get the correct coordinates as both should be:

chr1:6259459-6259468

image

I attached the files for your references.

Thank you!

x.zip

yguoma commented 6 years ago

The KSM motif match SeqPos position shows the expected binding position of the k-mer set. Because the k-mers are consistently aligned in a KSM, you can figure out what is the binding position in the k-mer sequences. For example, in the provided Oct4.KSM.txt, the offset of first k-mer ATGCAAA is 1, which means the starting base A of ATGC is at position 1 relative to the binding position (at 0). From the fouth k-mer TATGCANA (offset 0), you can also figure out the binding position is T, one base before ATGC. With this in mind, you can check the motif match position in the query sequence. In addition, for motif instances on the minus strand, the SeqPos is the position on the reverse compliment of the input sequence.

For your example, the first match, GTGCTGGGAT, if you take its reverse compliment, it is ATCCCAGCAC, which is the match sequence is the orientation of the individual component k-mers (ATCCCAGC,ATCCNAGCA,TCCCAGCA,ATCCCAGNA,ATCCCAGNNC) in the KSM. Note that this motif match sequence is longer than the individual k-mers, but you can see that it is assembled by these individual k-mers. This is how short k-mers can match longer motif sequences.

I understand you find discrepancy in SeqPos 330 and 331. I want to point out that these two are from different motifs/KSMs. You need to check the offsets of the matched k-mers. This discrepancy is likely coming from the uncertainty when estimating the offset/expected binding position from the motif positions inside the input sequences. Similar but different motifs may have slight differences. It is analogous to two PWMs of the same TF binding that have different length.

Hope this is clear. Let me know if you have more questions.

xubeisi commented 6 years ago

Thanks Yuchun for the explanation. However, it's still not clear how I can get the exactly position of the motif matched.

I understood most of format, such as the match sequence is the assembly of individual k-mers. But how to calculate position by all the information provided? these two are indeed from different KSMs, I checked the offsets for those matches k-mers.

For SeqPos = 331 and for ATCCCAGCAC:ATCCCAGC,ATCCNAGCA,TCCCAGCA,ATCCCAGNA,ATCCCAGNNC the offset from ksm file: ATCCCAGC/GCTGGGAT 2 ATCCNAGCA/TGCTNGGAT 2 TCCCAGCA/TGCTGGGA 3 ATCCCAGNA/TNCTGGGAT 2 ATCCCAGNNC/GNNCTGGGAT 2

For SeqPos = 330 and ATCCCAGCAC:ATCCCA,CAGCAC the offset from ksm file: ATCCCA/TGGGAT 2 CAGCAC/GTGCTG 6

so how exactly can I get the exactly the same position(start - end) by these information?

Best wishes, Beisi

yguoma commented 6 years ago

For the first motif, the binding site position is the B position (i.e. 2 bases before ATC) in the following: BNATCCCAGC, offset=2, meaning the start of k-mer (A) is at position 2 relative to B (position 0) BNATCCNAGCA BNNTCCCAGCA BNATCCCAGNA BNATCCCAGNNC

For the second motif, similarly, binging site position B is 2 bases before ATC: BNATCCCA, offset=2 BNNNNNCAGCAC, offset=6

These two motif matches are from different motifs (motif ID 0 and ID 1). Therefore, you should not assume that they will have the same position. If you are using two similar PWMs, say TCCCAGCA and ATCCCA to scan this sequence, they match to different positions.

xubeisi commented 6 years ago

Thanks. HMM, I guess it's not possible to get the exact location of "ATCCCAGCAC:" from your output then.

minigel commented 5 years ago

Hello @yguoma,

I apologize for reviving an old thread, but I have a similar question about how to obtain coordinates of the matched k-mer from KSM results.

I am stuck on the following example, which is the third entry in the provided Oct4.KSM.scan.motifInstances.txt:

Motif SeqID Motif_Name SeqName Match SeqPos Coord Strand Score
0 0 Oct4.m0 seq_1 691.0 ES_Oct4 3:121848465-121848525 CATGNNAATG:ATGNNAATG,CATGNNAAT, 44 N.A. + 1.69

The match sequence is CATGNNAATG, while the individual matched k-mers, and their offsets according to the provided Oct4.m0.KSM.txt, are:

k-mer r.c. Offset
ATGNNAATG CATTNNCAT 1
CATGNNAAT ATTNNCATG 0

The sequence of the corresponding input sequence in the provided ES_Oct4_61bp.fasta is:

seq_1 691.0 ES_Oct4 3:121848465-121848525 TGCCAATGGGCCCCATCAACATAACAATGACCCCTAT(CATCAT[G]ACAAT)GATCCCCATCAA

In the above sequence, I have enclosed SeqPos = 44 in [brackets] and the k-mer CATGNNAAT in (parentheses).

If my understanding is correct, since the k-mer CATGNNAAT has Offset = 0, I would expect the start of k-mer CATGNNAAT = expected binding position = SeqPos = 44 . However, this does not seem to be the case, and the start position of CATGNNAAT occurs several bases earlier.

Like @xubeisi, I am also very interested in obtaining the coordinates of the assembled matched sequence (e.g. CATGNNAATG), so any clarification would be greatly appreciated. Thank you very much in advance for your help!