pchampin / sophia_rs

Sophia: a Rust toolkit for RDF and Linked Data
Other
210 stars 23 forks source link

XML Parser fails on large file #77

Closed phillord closed 7 months ago

phillord commented 4 years ago

I have been trying out the XML parser on a large file. Even after an elongated period, it fails to parse, where the turtle parser succeeds.

As my large file I have been using the Gene Ontology available at:

http://purl.obolibrary.org/obo/go.owl

(The ttl version I have had to convert from this using the OWL API; I can put it somewhere if it is helpful).

The ttl version runs in 7 seconds, the XML version, I do not know whether it is stalling or just slow, because I have not had it complete yet.

fn main() -> Result<(),Error> {
    let input = "/home/phillord/scratch/go.ttl";
    //let input = "/home/phillord/scratch/go.owl";

    let file = File::open(input)?;
    let bufreader = BufReader::new(file);
    let triple_source = sophia::parser::turtle::parse_bufread(bufreader);
    //let triple_source = sophia::parser::xml::parse_bufread(bufreader);
    println!("collecting");
    let start = Instant::now();
    let graph: LightGraph = triple_source.collect_triples().unwrap();
    println!("{}: {:?}", graph.len(), start.elapsed());

    Ok(())
}
      Finished release [optimized] target(s) in 3.77s
     Running `target/release/horned-temp`
collecting
1431737: 7.743499503s
pchampin commented 4 years ago

which version of sophia are you using?

pchampin commented 4 years ago

Assuming you are using the latest release (0.5.3), I just pushed an experimental branch rio_xml. You might want to try it, and replace xml::RdfXmlParser by xml2::RdfXmlParser in your code, see if that solves this issue -- and possibly #76 as well.

If it does, I will probably switch to this implementation as the default RDF/XML parser.

phillord commented 4 years ago

I'm trying to work my way through this. It seems to work and parse much quicker, but it's not a drop in replacement in my code.

Currently my main use for this just dumps graphs out into [Term; 3]. So I do this:

    let triple_iter = sophia::parser::xml::parse_bufread(bufread);

    let triple_result: Result<Vec<_>, _> = triple_iter.collect();
    let triple_v: Vec<[SpTerm; 3]> = triple_result.unwrap();

But I can't drop in replace this with xml2, and I haven't managed to work out how to get triples from the xml2::RdfXmlParser. Apologies, I find the API rather confusing! I'd be grateful for any hints.

pchampin commented 4 years ago

It seems to work and parse much quicker,

good

but it's not a drop in replacement in my code.

not quite, you are right...

Apologies, I find the API rather confusing! I'd be grateful for any hints.

I should be the one to apologize... I'm sorry you feel that way about the API, and I am open to any suggestion to make it easier.

Now about your problem:

TL/DR

This should work for you:

    let triple_source = sophia::parser::xml2::parse_bufread(bufread);
    let triple_result: Result<Vec<[BoxTerm;3]>, _> = triple_source.collect_triples();
    let triple_v = triple_result.unwrap();

Explanations

I hope this helps.

pchampin commented 4 years ago

FTR, there was an error in my previous comment; Vec<BoxTerm> should have been Vec<[BoxTerm;3]>. I just edited it to fix that.

phillord commented 4 years ago

I have it working now. It's taking me a while to test, because I think my code was dependent on behaviour from the old parser that was actually buggy.

phillord commented 4 years ago

Well, it seems to be working well. The two failures I were getting in my test suite were, I am sure, because of behaviour that was buggy in the old parser. It also fixes #76.

In terms of the API, I think the issue is partly mine. I still not find Rust entirely natural to use. Especially when implemented though traits, the documentation you need in the Rust doc can be several clicks away or deep in the page. Main thing that would help would be a bit more module documentation and especially examples!

I need to think more on sophia, because at the moment my own https://github.com/phillord/horned-owl duplicates some of the functionality. Too many options.

pchampin commented 4 years ago

Well, it seems to be working well.

Great. I'll make the Rio parser the default in the next release. I'll close both issues then.

Main thing that would help would be a bit more module documentation and especially examples!

Yep, that's a pertaining item on my TODO list ;)

phillord commented 4 years ago

More documentation is on everyone's TODO list:-)

Do you have an ETA for a new release?

pchampin commented 4 years ago

Do you have an ETA for a new release?

I'm hoping to do it by the end of June or beginning of July.

phillord commented 4 years ago

Okay, thanks for letting me know!

pchampin commented 4 years ago

I'm hoping to do it by the end of June or beginning of July.

A little later than announced, but v0.6.0 is now out, with parser::xml now based on Rio parser. Give it a try, and feel free to close this issue (and #76) if your problems are solved.

pchampin commented 3 years ago

@phillord are you ok to close this issue? Since the pre-release patch "[seemed] to be working well", I am assuming that your problem is also solved with the current release.

pchampin commented 3 years ago

@phillord up?

are you ok to close this issue? Since the pre-release patch "[seemed] to be working well", I am assuming that your problem is also solved with the current release.

pchampin commented 7 months ago

closing: this issue is very old, and the current RDF/XML parser processes http://purl.obolibrary.org/obo/go.owl without any problem