owlcs / owlapi

OWL API main repository
826 stars 315 forks source link

Use format detection heuristics to reduce number of failing parse attempts. #554

Open sesuncedu opened 8 years ago

sesuncedu commented 8 years ago

Splitting off from #550 .

There are a number of heuristics that can be used to select and order parsers instead of trying every parser until one of them succeeds.

There are three sources of information that are available:

  1. Mime Type (for items retrieved via HTTP)
  2. File extension (for files; can also be used on URLs if no mime type information is available. Not helpful if extension is .owl
  3. File content. Many formats can be eliminated by reviewing a relatively small amount of content. Some formats are subsets of other formats, and it may not be possible to determine whether a subset is complete until the entire file has been processed (it's n-triples until - suprise - it's trig).

Input Streams used for content analysis might need to use some subclass of BufferedInputStream modified to fail attempts to read past marklimit (instead of only throwing the exception in reset(). This prevents a misbehaving content analyzer from messing up later analyzers.

ansell commented 8 years ago

On the TriG/etc. issue, you should be able to parse the subsets (Turtle/N-Triples) directly using the TriG parser, although I can't recall the exact performance overhead for that compared to the dedicated parsers. Putting N-Quads in front of TriG (and pushing Turtle/N-Triples to just before TriX) could be useful even without active heuristics, as N-Quads isnt a subset of TriG but in all but the N-Triples cases will fizz out after a few lines and it won't bork on N-Triples at any stage.