Open dimatr opened 4 years ago
Hi @dimatr the good news first:
http://ttl.summerofcode.be/ says you produced valid .ttl
syntax! Good job. One minor thing: Please remove the newlines between each record, this will save some space, hopefully. It will still be valid Turtle.
For better debugging, let's use the files given in svg_opg1.zip. It is a small example graph I came up with. I ran
python ~/git/component_segmentation/segmentation.py -j svg1.gfa.og.w1.json -f svg1.gfa.og.fasta -o ./cs_svg1 -t svg1.gfa.og.w1.json.ttl
And got cs_svg1.zip. On first glance everything looks good, except the following:
path0
. Here the corresponding path name should pop up.vg:linkPaths
are counted up from 1-n. Please replace these with the corresponding names.vg:Link
: I am missing forwardLinkEdge
and reverseLinkEdge
.vg:Bin
: I am missing forwardBinEdge
and reverseBinEdge
.Thanks for your implementation! My comments:
linkRank
or forwardLinkEdge/reverseLinkEdge
should be defined between two links, by sorting all paths with two keys (upstream, downstream).vg:Link
does not affect on forwardBinEdge
and reverseBinEdge
, just connecting bins with arrival
and departure
.<5/region/28-28> a faldo:Region ;
faldo:begin <5/position/28> ;
faldo:end <5/position/28> .
which does not make sense, because the overall pangenome nucleotide length is 17. See
[heumos@wave svg_opg1]$ odgi stats -S -i svg1.gfa.og -V
length: 17
nodes: 4
edges: 7
paths: 3
cov sets
3 5,5-,
10 5,5-,6,
4 6,
-5
and 6
. This means that all classes relying on that information have the wrong regions, as mentioned above.w1
, an exact position is sufficient -> https://github.com/OBF/FALDO#single-position. We don't need a full FALDO region.I just realized, the testing GFA has two 6
paths in it....... sorry @dimatr .
I will fix this and start a fresh testing.
This would also explain the weird odgi bin
outputs we observed. So no bug, I guess :) Just a dump user.
<pg/zoom1/component2/bin3/cell5> a vg:Cell ;
vg:cellRegion <5/region/3> ;
vg:inversionPercent 0 ;
vg:positionPercent 1 .
We shoud have a faldo:exactPosition
and not a vg:cellRegion
here, right? Else, we could not connect to the exact positions.
The cleaned data: svg_opg1_12052020.zip
I think CS is outputting more links compared to what odgi bin
gives us. Short example:
<pg/zoom1/link2> a vg:Link ;
vg:arrival <pg/zoom1/component3/bin5> ;
vg:linkPaths <5>,
<5-> .
But in the odgi bin
output using -g
, such a link does not exist. CS must make them up?! @6br @dimatr Can you confirm this?
By the way, I saw these links also in the usual CS output, so it is not an issue with the Turtle implementation.
I decided to open an issue for my concerns about CS https://github.com/graph-genome/component_segmentation/issues/48.
I have made another update. The Cell - Region - ExactPosition
relations are now understandable:
<pg/zoom1/component2/bin3/cell5> a vg:Cell ;
vg:cellRegion <5/3-3> ;
vg:inversionPercent 0 ;
vg:positionPercent 1 .
<5/3-3> a faldo:Region ;
faldo:begin <5/3> ;
faldo:end <5/3> .
<5/3> a faldo:ExactPosition ;
faldo:position 3 .
This way all the object go down to the atomic ones. vg:Cell should contain vg:Region - this is in the vg schema.
I use a short identifier <path1/2-3> instead of a longer version <path1/region/2-3> as in the example. Is it an acceptable approach?
For me, <path1/2-3>
is natural and acceptable. My concern is if there is a way to describe the path name as objects explicitly.
@dimatr Is it possible to embed a path on faldo:ExactPosition
?
<5/1> a faldo:ExactPosition ;
faldo:position 1 ;
faldo:reference 5 .
The orientation of path can be encoded like this (if the reference is positive strand)
_:1b a faldo:ExactPosition, faldo:ForwardStrandPosition ;
faldo:position 1 ;
faldo:reference ddbj:XXXDSDS .
Because if the path is inverted is encoded in vg:inversionPercent
, I think we can dispense with faldo:ForwardStrandPosition
.
As @JervenBolleman suggests, it's better to set a path name as an independent subject.
<chr1/exactposition/1> a faldo:ExactPosition, faldo:ForwardStrandPosition ;
faldo:position 1 ;
faldo:reference <chr1> .
<chr1> a vg:Path .
@dimatr here is the current output of odgi bin
with the orientation encoded in the ranges:
svg1.gfa.og.w1.json.zip
Command line I used:
/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 1 -s > svg1.gfa.og.w1.json
Now it should be possible to encode the orientation of a path for all cases.
Do you have any more questions? Need some feedback? Or comments from @6br ?
Feel free to contact me at any time when you have any questions!
Hi @dimatr thanks for the hard work! There are some minor things I stumpled upon.
I still see links like
<pg/zoom1/link2> a vg:Link ;
vg:arrival <pg/zoom1/component3/bin5> ;
vg:departure <pg/zoom1/component2/bin4> ;
vg:linkPaths <5>,
<5-> ;
vg:linkRank 2 .
<pg/zoom1/link3> a vg:Link ;
vg:arrival <pg/zoom1/component4/bin12> ;
vg:departure <pg/zoom1/component3/bin11> ;
vg:linkPaths <5>,
<5->,
<6> ;
vg:linkRank 3 .
which are not in the given JSON. What is the status of #48?
Regarding the faldo:ExactPosition
:
<5-/1> a faldo:BackwardStrandPosition,
faldo:ExactPosition ;
faldo:position 1 ;
faldo:reference <5-> .
<5-/10> a faldo:BackwardStrandPosition,
faldo:ExactPosition ;
faldo:position 10 ;
faldo:reference <5-> .
<5-/5> a faldo:ExactPosition,
faldo:ForwardStrandPosition ;
faldo:position 5 ;
faldo:reference <5-> .
There are several issues here:
faldo:ReverseStrandPosition
faldo:ExactPosition
before faldo:ReverseStrandPosition
. You have done that for the faldo:ForwardStrandPosition
faldo:ExactPosition
that are labeled as faldo:ForwardStrandPosition
should be declared as faldo:BackwardStrandPosition
. And vice versa. Sometimes, I am missing the vg:forwardBinEdge
<pg/zoom1/component3/bin11> a vg:Bin ;
vg:binRank 11 ;
vg:cells <pg/zoom1/component3/bin11/cell5>,
<pg/zoom1/component3/bin11/cell5-> ;
vg:reverseBinEdge <pg/zoom1/component3/bin10> .
and sometimes the vg:reverseBinEdge
<pg/zoom1/component4/bin12> a vg:Bin ;
vg:binRank 12 ;
vg:cells <pg/zoom1/component4/bin12/cell5>,
<pg/zoom1/component4/bin12/cell5->,
<pg/zoom1/component4/bin12/cell6> ;
vg:forwardBinEdge <pg/zoom1/component4/bin13> .
I will now convert the GFA to .ttl
using SpOdgi
in order to compare the IRI
scheme.
-nal
run argumentfaldo:ReverseStrandPosition
is fixedNice! Yes you are right, I missed that about the alphabetical printing.
I am still puzzled about vg:forwardBinEdge
and vg:reverseBinEdge
.
I now tried with a zooming level of 2. Here, the orientation of the ExactPosition
still doesn't work:
JSON excerpt:
{"path_name":"5-","bins":[[1,1,0,0.0384615,[[1,2]]]
Turtle excerpt:
<5-/1> a faldo:ExactPosition,
faldo:ReverseStrandPosition ;
faldo:position 1 ;
faldo:reference <5-> .
It should be ForwardStrandPosition
. Commands I used:
/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 2 -s -g > svg1.gfa.og.w2.g.json
python ~/git/component_segmentation/segmentation.py -j svg1.gfa.og.w2.g.json -f svg1.gfa.og.fasta -t -o dmytro_18062020_w2 -nal
inversionPercent
and positionPercent
should be printed as double:
vg:inversionPercent 5e-01 ;
vg:positionPercent 5e-01 .
[1,2] should represent the ForwardStrandPosition
- obviously. Please make sure the "single bin" intervals are odgi
encoded in a similar way, e.g. [0,5] - ForwardStrandPosition
, [5,0] - ReverseStrandPosition
. I am not sure it is the case now
All the printouts are created by the RDF Python library call, I have no control on the decimal formatting.
https://rdflib.readthedocs.io/en/stable/rdf_terms.html#literals
This is not ideal but it might be better to explicitly write Literal(self.position_percent, datatype=XSD.double)
got it, *_percent
are doubles now.
[1,2] should represent the
ForwardStrandPosition
- obviously. Please make sure the "single bin" intervals areodgi
encoded in a similar way, e.g. [0,5] -ForwardStrandPosition
, [5,0] -ReverseStrandPosition
. I am not sure it is the case now
It works for a bin width of 1, but for a bin width of e.g. 2 it turns around the orientation again.
Now I get the logic. In CS
we have ontology.Position(real_end, begin >= end, path, cell_ns)
.
But we have a logic conflict. Here an overview of the current orientation encoding from odgi bin
:
So with the current code, we don't filter as we should. Because we will always hit a forward and a reverse case. Currently thinking about changing odgi bin
so that the ranges fit the rules in CS
.
@dimatr @6br This is ready from my point of view.
But our IRIs don't match the ones from SpOdgi
100%.
One Example:
CS
:
<6/11> a faldo:ExactPosition,
faldo:ForwardStrandPosition ;
faldo:position 11 ;
faldo:reference <6> .
SpOdgi
:
<6-p11> <http://biohackathon.org/resource/faldo#reference> <http://example.org/vg/path/6> .
<6-p15> <http://biohackathon.org/resource/faldo#position> "15"^^<http://www.w3.org/2001/XMLSchema#integer> .
<6-p15> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://biohackathon.org/resource/faldo#ExactPosition> .
<6-p15> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://biohackathon.org/resource/faldo#Position> .
<6-p15> <http://biohackathon.org/resource/faldo#reference> <http://example.org/vg/path/6> .
What I specifically mean is <6/11>
vs <6-p11>
.
Full example files:
seq_chunk0_bin1.ttl.txt svg1.gfa.og.ttl.txt
I will discuss this with @jervenbolleman. When we clarified this, we can merge it into master.
Our path IRI needs to be path/5
instead of just 5
so we are in concordance with SpOdgi.
Another note: For some reason, even when the base IRI is set, it is not present for the external Ontologies we fiddle in. This is also the case for SpOdgi. We can fill that gap using apache-jena/bin/riot
.
In order to harmonize things, I would like to replace pg
with vg
so it will be easer to sync with SpOdgi. @6br @dimatr What do you think of this?
I agree to have a consistent IRI with spOdgi
. Also, I also would like to replace pg with vg so it will be easer to sync with SpOdgi.
I have just pushed a big update targeting large datasets, please have a look.
Nice work @dimatr ! Really cool.
I can get a RDF output for the current PubSeq data of ~1300 genomes in 22 minutes :) But it gives me the following warning:
[06/07/2020 08:48:39 - INFO - __main__ - 97] Starting Segmentation process on 1190 Paths.
[06/07/2020 08:48:39 - INFO - __main__ - 247] Largest bin_id was 129679; Found 7427 dividers.
[06/07/2020 08:48:39 - INFO - __main__ - 252] Input has 33806 listed Links. Segmentation eliminated 0.0% of them.
[06/07/2020 08:48:39 - INFO - __main__ - 254] Found 33806 unique links
[06/07/2020 08:48:39 - INFO - __main__ - 105] Created dividers
[06/07/2020 08:48:39 - INFO - __main__ - 118] Created 7999 components
[06/07/2020 08:49:08 - INFO - __main__ - 92] Populated Matrix and Occupancy per component per path.
[06/07/2020 08:49:08 - INFO - __main__ - 122] populated matrix
[06/07/2020 08:49:09 - INFO - __main__ - 152] Created 12010 LinkColumns
/home/ubuntu/software/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
"timeout or by a memory leak.", UserWarning
Saved file2bin mapping to PubSeqTurtle_02062020/bin2file.json
[06/07/2020 09:09:13 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json
real 22m9.622s
user 72m51.954s
sys 1m37.888s
The resulting n3
is 108GB uncompressed! So no use putting it into a SPARQL endpoint on a machine with 64GB for now.
I also ran it on the current Pantograph demo data which consists of 169 genomes. Here it only took 3 minutes ;) We have 8GB uncompressed and it fits into the RAM database of a Fuseki-Jena-Server which needs 60GB RAM.
Of course, I tried our default queries, and the one which should retrieve all the links, including arrival
, departure
, linkPaths
and linkZoomLevel
did not succeed. I double-checked and for some links departure
exists but not the corresponding arrival
. Please check the attachment for the complete list. If you want to take a closer look at these files just ping me.
In my example tests, this problem did not occur. Might be due to the nature of the SARS data, but still, we would need these links fully serialized to RDF. Or not at all in the RDF. Whatever the biology says.
selectLinksResult.txt
The encoding of the paths is working, but still not 100% in line with SpOdgi. When listing them, we are doing it correctly, but when referring to them, we are still omitting the path/
.
It's possible to tune parameters to avoid warnings.
At first glance on selectLinksResult.txt
, the same link identifier is shared among more than two bins, which might be the matter.
I suspect that something wrong in some boundary condition, but I haven't checked yet.
I'll vote for adding the prefix path/
in the path identifier. I'll update it if there is no objection.
Cool, that would be awesome.
I think the problem is that only departures are synced so far, but not arrivals. @dimatr suggested implementing the same sanity check for arrivals as was done in departures. Maybe he can give you some tips.
Yes, I vote for it! I had that, but I added it only in the Path
ontology object. Then I realized, we need to alter the code in every ontology object, that emits a path. So be careful, when changing this!
[18/07/2020 14:07:07 - INFO - __main__ - 97] Starting Segmentation process on 1190 Paths.
[18/07/2020 14:07:07 - INFO - __main__ - 247] Largest bin_id was 129679; Found 7427 dividers.
[18/07/2020 14:07:07 - INFO - __main__ - 252] Input has 33806 listed Links. Segmentation eliminated 0.0% of them.
[18/07/2020 14:07:07 - INFO - __main__ - 254] Found 33806 unique links
[18/07/2020 14:07:07 - INFO - __main__ - 105] Created dividers
[18/07/2020 14:07:07 - INFO - __main__ - 118] Created 7999 components
[18/07/2020 14:07:35 - INFO - __main__ - 92] Populated Matrix and Occupancy per component per path.
[18/07/2020 14:07:35 - INFO - __main__ - 122] populated matrix
[18/07/2020 14:07:35 - INFO - __main__ - 152] Created 12010 LinkColumns
Saved file2bin mapping to PubSeqTurtle_02062020/bin2file.json
[18/07/2020 14:31:28 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json
real 26m12.550s
user 19m48.632s
sys 0m22.386s
Now warning messages disappeared.
In pubseq data, path name looks like
<path/http://collections.lugli.arvadosapi.com/c=13a2b522d373d0f6bfd95a58f821c677+123/sequence.fasta>
Is it fine?
Furthermore, cell name looks like
<http://example.org/vg/zoom1/component4/bin3/cellhttp://collections.lugli.arvadosapi.com/c=e4c1e7ed3a305e5b49993a2c042c4572+123/sequence.fasta>
cell name is now updated to
<http://example.org/vg/zoom1/component4/bin3/cell/path<pathname>/http://collections.lugli.arvadosapi.com/c=e4c1e7ed3a305e5b49993a2c042c4572+123/sequence.fasta>
I added assertion on building pangenomic sequence, and many downstream looks missing.
[26/07/2020 05:44:52 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 114499
[26/07/2020 05:44:52 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 110940
[26/07/2020 05:44:53 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 101825
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 60195
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 33805
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 98595
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 42908
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 105785
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 121714
[26/07/2020 05:44:56 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 34421
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 55038
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 80784
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 104129
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 76952
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 101961
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 59040
[26/07/2020 05:45:00 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 42975
[26/07/2020 05:45:00 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 30423
I added the assertion if there is no arrivals. So we easily know when arrivals are missing. I found this error is due to multi-processing. Instead of using multi-processing, I succeeded to get RDF without any errors.
Saved file2bin mapping to PubSeqTurtle_02062020_2/bin2file.json
[28/07/2020 07:59:10 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json
24310.23user 418.13system 6:27:39elapsed 106%CPU (0avgtext+0avgdata 316126732maxresident)k
264inputs+235223048outputs (0major+118306390minor)pagefaults 0swaps
$ ~/data/PubSeqTurtle_02062020/PubSeqTurtle_02062020_2/1-rdf$ ls -ltrh
total 13G
-rw-rw-r-- 1 ubuntu ubuntu 13G Jul 28 07:22 seq_chunk00000_bin1.nt.gz
This has become a very long branch. Is this PR ready for merging now? I'm not available to review at the moment. @subwaystation are you available for review?
Please review.
--do-ttl True
{bin_width}-turtle
folder besides the main .json output folder, e.g.1000-turtle