ontology repacking and exporting - relates to #34

dimatr commented 4 years ago

Please review.

To use the code please add to the parameters --do-ttl True
The output .ttl files will go to the {bin_width}-turtle folder besides the main .json output folder, e.g. 1000-turtle
Double check the repacking logic in the PangenomeSchematic.py
Check if the bin/path/etc. ids are used correctly
Do not run it on the heavy sets, the turtle repack+dump may take really long

subwaystation commented 4 years ago

Hi @dimatr the good news first: http://ttl.summerofcode.be/ says you produced valid .ttl syntax! Good job. One minor thing: Please remove the newlines between each record, this will save some space, hopefully. It will still be valid Turtle.

For better debugging, let's use the files given in svg_opg1.zip. It is a small example graph I came up with. I ran

python ~/git/component_segmentation/segmentation.py -j svg1.gfa.og.w1.json -f svg1.gfa.og.fasta -o ./cs_svg1 -t svg1.gfa.og.w1.json.ttl

And got cs_svg1.zip. On first glance everything looks good, except the following:

Each path name in a FALDO region is always path0. Here the corresponding path name should pop up.
Path names in vg:linkPaths are counted up from 1-n. Please replace these with the corresponding names.
vg:Link: I am missing forwardLinkEdge and reverseLinkEdge.
vg:Bin: I am missing forwardBinEdge and reverseBinEdge.

6br commented 4 years ago

Thanks for your implementation! My comments:

linkRank or forwardLinkEdge/reverseLinkEdge should be defined between two links, by sorting all paths with two keys (upstream, downstream).
I think vg:Link does not affect on forwardBinEdge and reverseBinEdge, just connecting bins with arrival and departure.
More precisely, faldo:region consists of two exact positions pointing out a reference c.f. https://github.com/OBF/FALDO.

subwaystation commented 4 years ago

I see

<5/region/28-28> a faldo:Region ;
faldo:begin <5/position/28> ;
faldo:end <5/position/28> .

which does not make sense, because the overall pangenome nucleotide length is 17. See

[heumos@wave svg_opg1]$ odgi stats -S -i svg1.gfa.og -V
length: 17
nodes:  4
edges:  7
paths:  3
cov sets
3   5,5-,
10  5,5-,6,
4   6,

No FALDO regions for paths -5 and 6. This means that all classes relying on that information have the wrong regions, as mentioned above.
An addition to @6br FALDO comment: For single positions, which we will have for w1, an exact position is sufficient -> https://github.com/OBF/FALDO#single-position. We don't need a full FALDO region.

dimatr commented 4 years ago

region begin/end are taken from the bin.nucleotide_ranges, and those can go up to 28 - see the input path "6". Please check if this logic is correct https://github.com/graph-genome/component_segmentation/blob/ontology/matrixcomponent/PangenomeSchematic.py#L156
fixed
fixed

subwaystation commented 4 years ago

I just realized, the testing GFA has two 6 paths in it....... sorry @dimatr . I will fix this and start a fresh testing.

subwaystation commented 4 years ago

This would also explain the weird odgi bin outputs we observed. So no bug, I guess :) Just a dump user.

subwaystation commented 4 years ago

<pg/zoom1/component2/bin3/cell5> a vg:Cell ;
    vg:cellRegion <5/region/3> ;
    vg:inversionPercent 0 ;
    vg:positionPercent 1 .

We shoud have a faldo:exactPosition and not a vg:cellRegion here, right? Else, we could not connect to the exact positions.

The cleaned data: svg_opg1_12052020.zip

I think CS is outputting more links compared to what odgi bin gives us. Short example:

<pg/zoom1/link2> a vg:Link ;
    vg:arrival <pg/zoom1/component3/bin5> ;
    vg:linkPaths <5>,
        <5-> .

But in the odgi bin output using -g, such a link does not exist. CS must make them up?! @6br @dimatr Can you confirm this? By the way, I saw these links also in the usual CS output, so it is not an issue with the Turtle implementation.

subwaystation commented 4 years ago

I decided to open an issue for my concerns about CS https://github.com/graph-genome/component_segmentation/issues/48.

dimatr commented 4 years ago

I have made another update. The Cell - Region - ExactPosition relations are now understandable:

<pg/zoom1/component2/bin3/cell5> a vg:Cell ;
    vg:cellRegion <5/3-3> ;
    vg:inversionPercent 0 ;
    vg:positionPercent 1 .

<5/3-3> a faldo:Region ;
    faldo:begin <5/3> ;
    faldo:end <5/3> .

<5/3> a faldo:ExactPosition ;
    faldo:position 3 .

This way all the object go down to the atomic ones. vg:Cell should contain vg:Region - this is in the vg schema.

I use a short identifier <path1/2-3> instead of a longer version <path1/region/2-3> as in the example. Is it an acceptable approach?

6br commented 4 years ago

For me, <path1/2-3> is natural and acceptable. My concern is if there is a way to describe the path name as objects explicitly.

6br commented 4 years ago

@dimatr Is it possible to embed a path on faldo:ExactPosition?

<5/1> a faldo:ExactPosition ;
    faldo:position 1 ;
    faldo:reference 5 .

The orientation of path can be encoded like this (if the reference is positive strand)

_:1b   a  faldo:ExactPosition, faldo:ForwardStrandPosition ;
            faldo:position 1 ;
            faldo:reference ddbj:XXXDSDS .

Because if the path is inverted is encoded in vg:inversionPercent, I think we can dispense with faldo:ForwardStrandPosition.

6br commented 4 years ago

As @JervenBolleman suggests, it's better to set a path name as an independent subject.

<chr1/exactposition/1>  a  faldo:ExactPosition, faldo:ForwardStrandPosition ;
            faldo:position 1 ;
            faldo:reference <chr1> .
<chr1> a vg:Path .

subwaystation commented 4 years ago

@dimatr here is the current output of odgi bin with the orientation encoded in the ranges: svg1.gfa.og.w1.json.zip Command line I used:

/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 1 -s > svg1.gfa.og.w1.json

Now it should be possible to encode the orientation of a path for all cases.

Do you have any more questions? Need some feedback? Or comments from @6br ?

6br commented 4 years ago

Feel free to contact me at any time when you have any questions!

subwaystation commented 4 years ago

Hi @dimatr thanks for the hard work! There are some minor things I stumpled upon.

I still see links like

<pg/zoom1/link2> a vg:Link ;
    vg:arrival <pg/zoom1/component3/bin5> ;
    vg:departure <pg/zoom1/component2/bin4> ;
    vg:linkPaths <5>,
        <5-> ;
    vg:linkRank 2 .

<pg/zoom1/link3> a vg:Link ;
    vg:arrival <pg/zoom1/component4/bin12> ;
    vg:departure <pg/zoom1/component3/bin11> ;
    vg:linkPaths <5>,
        <5->,
        <6> ;
    vg:linkRank 3 .

which are not in the given JSON. What is the status of #48?

Regarding the faldo:ExactPosition:

<5-/1> a faldo:BackwardStrandPosition,
        faldo:ExactPosition ;
    faldo:position 1 ;
    faldo:reference <5-> .

<5-/10> a faldo:BackwardStrandPosition,
        faldo:ExactPosition ;
    faldo:position 10 ;
    faldo:reference <5-> .

<5-/5> a faldo:ExactPosition,
        faldo:ForwardStrandPosition ;
    faldo:position 5 ;
    faldo:reference <5-> .

There are several issues here:

https://github.com/OBF/FALDO#known-positions says it should be faldo:ReverseStrandPosition
Please put the faldo:ExactPosition before faldo:ReverseStrandPosition. You have done that for the faldo:ForwardStrandPosition
All faldo:ExactPosition that are labeled as faldo:ForwardStrandPosition should be declared as faldo:BackwardStrandPosition. And vice versa.

Sometimes, I am missing the vg:forwardBinEdge

<pg/zoom1/component3/bin11> a vg:Bin ;
    vg:binRank 11 ;
    vg:cells <pg/zoom1/component3/bin11/cell5>,
        <pg/zoom1/component3/bin11/cell5-> ;
    vg:reverseBinEdge <pg/zoom1/component3/bin10> .

and sometimes the vg:reverseBinEdge

<pg/zoom1/component4/bin12> a vg:Bin ;
    vg:binRank 12 ;
    vg:cells <pg/zoom1/component4/bin12/cell5>,
        <pg/zoom1/component4/bin12/cell5->,
        <pg/zoom1/component4/bin12/cell6> ;
    vg:forwardBinEdge <pg/zoom1/component4/bin13> .

I will now convert the GFA to .ttl using SpOdgi in order to compare the IRI scheme.

dimatr commented 4 years ago

To get rid of the strange links please use the -nal run argument
faldo:ReverseStrandPosition is fixed
the order of the attributes is irrelevant. They are printed out alphabetically.
I have swapped the directions

subwaystation commented 4 years ago

Nice! Yes you are right, I missed that about the alphabetical printing.

I am still puzzled about vg:forwardBinEdge and vg:reverseBinEdge.

subwaystation commented 4 years ago

I now tried with a zooming level of 2. Here, the orientation of the ExactPosition still doesn't work:

JSON excerpt:

{"path_name":"5-","bins":[[1,1,0,0.0384615,[[1,2]]]

Turtle excerpt:

<5-/1> a faldo:ExactPosition,
        faldo:ReverseStrandPosition ;
    faldo:position 1 ;
    faldo:reference <5-> .

It should be ForwardStrandPosition. Commands I used:

/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 2 -s -g > svg1.gfa.og.w2.g.json
python ~/git/component_segmentation/segmentation.py -j svg1.gfa.og.w2.g.json -f svg1.gfa.og.fasta -t -o dmytro_18062020_w2 -nal

subwaystation commented 4 years ago

inversionPercent and positionPercent should be printed as double:

    vg:inversionPercent 5e-01 ;
    vg:positionPercent 5e-01 .

dimatr commented 4 years ago

[1,2] should represent the ForwardStrandPosition - obviously. Please make sure the "single bin" intervals are odgi encoded in a similar way, e.g. [0,5] - ForwardStrandPosition, [5,0] - ReverseStrandPosition. I am not sure it is the case now

dimatr commented 4 years ago

All the printouts are created by the RDF Python library call, I have no control on the decimal formatting.

6br commented 4 years ago

https://rdflib.readthedocs.io/en/stable/rdf_terms.html#literals This is not ideal but it might be better to explicitly write Literal(self.position_percent, datatype=XSD.double)

dimatr commented 4 years ago

got it, *_percent are doubles now.

subwaystation commented 4 years ago

[1,2] should represent the ForwardStrandPosition - obviously. Please make sure the "single bin" intervals are odgi encoded in a similar way, e.g. [0,5] - ForwardStrandPosition, [5,0] - ReverseStrandPosition. I am not sure it is the case now

It works for a bin width of 1, but for a bin width of e.g. 2 it turns around the orientation again.

subwaystation commented 4 years ago

Now I get the logic. In CS we have ontology.Position(real_end, begin >= end, path, cell_ns). But we have a logic conflict. Here an overview of the current orientation encoding from odgi bin:

Single Position Forward: [N, 0].
Single Position Reverse: [0,N].
Range Forward: [N, N+i].
Range Reverse: [N-i,N]. Where N is a nucleotide position and i an integer so that the resulting number is a valid nucleotide position, too.

So with the current code, we don't filter as we should. Because we will always hit a forward and a reverse case. Currently thinking about changing odgi bin so that the ranges fit the rules in CS.

subwaystation commented 4 years ago

@dimatr @6br This is ready from my point of view. But our IRIs don't match the ones from SpOdgi 100%. One Example: CS:

<6/11> a faldo:ExactPosition,
        faldo:ForwardStrandPosition ;
    faldo:position 11 ;
    faldo:reference <6> .

SpOdgi:

<6-p11> <http://biohackathon.org/resource/faldo#reference> <http://example.org/vg/path/6> .
<6-p15> <http://biohackathon.org/resource/faldo#position> "15"^^<http://www.w3.org/2001/XMLSchema#integer> .
<6-p15> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://biohackathon.org/resource/faldo#ExactPosition> .
<6-p15> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://biohackathon.org/resource/faldo#Position> .
<6-p15> <http://biohackathon.org/resource/faldo#reference> <http://example.org/vg/path/6> .

What I specifically mean is <6/11> vs <6-p11>. Full example files:

seq_chunk0_bin1.ttl.txt svg1.gfa.og.ttl.txt

I will discuss this with @jervenbolleman. When we clarified this, we can merge it into master.

subwaystation commented 4 years ago

Our path IRI needs to be path/5 instead of just 5 so we are in concordance with SpOdgi.

Another note: For some reason, even when the base IRI is set, it is not present for the external Ontologies we fiddle in. This is also the case for SpOdgi. We can fill that gap using apache-jena/bin/riot.

In order to harmonize things, I would like to replace pg with vg so it will be easer to sync with SpOdgi. @6br @dimatr What do you think of this?

6br commented 4 years ago

I agree to have a consistent IRI with spOdgi. Also, I also would like to replace pg with vg so it will be easer to sync with SpOdgi.

dimatr commented 4 years ago

I have just pushed a big update targeting large datasets, please have a look.

subwaystation commented 4 years ago

Nice work @dimatr ! Really cool.

PubSeq

I can get a RDF output for the current PubSeq data of ~1300 genomes in 22 minutes :) But it gives me the following warning:

[06/07/2020 08:48:39 - INFO - __main__ - 97] Starting Segmentation process on 1190 Paths.
[06/07/2020 08:48:39 - INFO - __main__ - 247] Largest bin_id was 129679; Found 7427 dividers.
[06/07/2020 08:48:39 - INFO - __main__ - 252] Input has 33806 listed Links.  Segmentation eliminated 0.0% of them.
[06/07/2020 08:48:39 - INFO - __main__ - 254] Found 33806 unique links
[06/07/2020 08:48:39 - INFO - __main__ - 105] Created dividers
[06/07/2020 08:48:39 - INFO - __main__ - 118] Created 7999 components
[06/07/2020 08:49:08 - INFO - __main__ - 92] Populated Matrix and Occupancy per component per path.
[06/07/2020 08:49:08 - INFO - __main__ - 122] populated matrix
[06/07/2020 08:49:09 - INFO - __main__ - 152] Created 12010 LinkColumns
/home/ubuntu/software/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
Saved file2bin mapping to PubSeqTurtle_02062020/bin2file.json
[06/07/2020 09:09:13 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json

real    22m9.622s
user    72m51.954s
sys     1m37.888s

The resulting n3 is 108GB uncompressed! So no use putting it into a SPARQL endpoint on a machine with 64GB for now.

Pantograph Demo

I also ran it on the current Pantograph demo data which consists of 169 genomes. Here it only took 3 minutes ;) We have 8GB uncompressed and it fits into the RAM database of a Fuseki-Jena-Server which needs 60GB RAM. Of course, I tried our default queries, and the one which should retrieve all the links, including arrival, departure, linkPaths and linkZoomLevel did not succeed. I double-checked and for some links departure exists but not the corresponding arrival. Please check the attachment for the complete list. If you want to take a closer look at these files just ping me. In my example tests, this problem did not occur. Might be due to the nature of the SARS data, but still, we would need these links fully serialized to RDF. Or not at all in the RDF. Whatever the biology says. selectLinksResult.txt

Path Encoding

The encoding of the paths is working, but still not 100% in line with SpOdgi. When listing them, we are doing it correctly, but when referring to them, we are still omitting the path/.

6br commented 4 years ago

PubSeq

It's possible to tune parameters to avoid warnings.

Pantograph Demo

At first glance on selectLinksResult.txt, the same link identifier is shared among more than two bins, which might be the matter. I suspect that something wrong in some boundary condition, but I haven't checked yet.

Path Encoding

I'll vote for adding the prefix path/ in the path identifier. I'll update it if there is no objection.

subwaystation commented 4 years ago

PubSeq

Cool, that would be awesome.

Pantograph Demo

I think the problem is that only departures are synced so far, but not arrivals. @dimatr suggested implementing the same sanity check for arrivals as was done in departures. Maybe he can give you some tips.

Path Encoding.

Yes, I vote for it! I had that, but I added it only in the Path ontology object. Then I realized, we need to alter the code in every ontology object, that emits a path. So be careful, when changing this!

6br commented 4 years ago

PubSeq

[18/07/2020 14:07:07 - INFO - __main__ - 97] Starting Segmentation process on 1190 Paths.
[18/07/2020 14:07:07 - INFO - __main__ - 247] Largest bin_id was 129679; Found 7427 dividers.
[18/07/2020 14:07:07 - INFO - __main__ - 252] Input has 33806 listed Links.  Segmentation eliminated 0.0% of them.
[18/07/2020 14:07:07 - INFO - __main__ - 254] Found 33806 unique links
[18/07/2020 14:07:07 - INFO - __main__ - 105] Created dividers
[18/07/2020 14:07:07 - INFO - __main__ - 118] Created 7999 components
[18/07/2020 14:07:35 - INFO - __main__ - 92] Populated Matrix and Occupancy per component per path.
[18/07/2020 14:07:35 - INFO - __main__ - 122] populated matrix
[18/07/2020 14:07:35 - INFO - __main__ - 152] Created 12010 LinkColumns
Saved file2bin mapping to PubSeqTurtle_02062020/bin2file.json
[18/07/2020 14:31:28 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json

real    26m12.550s
user    19m48.632s
sys     0m22.386s

Now warning messages disappeared.

6br commented 4 years ago

In pubseq data, path name looks like <path/http://collections.lugli.arvadosapi.com/c=13a2b522d373d0f6bfd95a58f821c677+123/sequence.fasta> Is it fine?

Furthermore, cell name looks like <http://example.org/vg/zoom1/component4/bin3/cellhttp://collections.lugli.arvadosapi.com/c=e4c1e7ed3a305e5b49993a2c042c4572+123/sequence.fasta>

6br commented 4 years ago

cell name is now updated to <http://example.org/vg/zoom1/component4/bin3/cell/path<pathname>/http://collections.lugli.arvadosapi.com/c=e4c1e7ed3a305e5b49993a2c042c4572+123/sequence.fasta>

6br commented 4 years ago

I added assertion on building pangenomic sequence, and many downstream looks missing.

[26/07/2020 05:44:52 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 114499
[26/07/2020 05:44:52 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 110940
[26/07/2020 05:44:53 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 101825
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 60195
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 33805
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 98595
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 42908
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 105785
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 121714
[26/07/2020 05:44:56 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 34421
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 55038
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 80784
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 104129
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 76952
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 101961
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 59040
[26/07/2020 05:45:00 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 42975
[26/07/2020 05:45:00 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 30423

6br commented 4 years ago

No-arrivals error is fixed

I added the assertion if there is no arrivals. So we easily know when arrivals are missing. I found this error is due to multi-processing. Instead of using multi-processing, I succeeded to get RDF without any errors.

Saved file2bin mapping to PubSeqTurtle_02062020_2/bin2file.json
[28/07/2020 07:59:10 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json
24310.23user 418.13system 6:27:39elapsed 106%CPU (0avgtext+0avgdata 316126732maxresident)k
264inputs+235223048outputs (0major+118306390minor)pagefaults 0swaps
$ ~/data/PubSeqTurtle_02062020/PubSeqTurtle_02062020_2/1-rdf$ ls -ltrh
total 13G
-rw-rw-r-- 1 ubuntu ubuntu 13G Jul 28 07:22 seq_chunk00000_bin1.nt.gz

josiahseaman commented 4 years ago

This has become a very long branch. Is this PR ready for merging now? I'm not available to review at the moment. @subwaystation are you available for review?

graph-genome / component_segmentation