google / magika

Detect file content types with deep learning
https://google.github.io/magika/
Apache License 2.0
7.77k stars 412 forks source link

Incorrect JSON/NDJSON detection #57

Open hunter-gatherer8 opened 7 months ago

hunter-gatherer8 commented 7 months ago

These are pretty minor, but:

  1. Simple JSON example that is recognized as "Generic text document (text)": no_whitespace.json. If you add a whitespace after ":" it will be "JSON document (code)"
  2. Same example with multiple newline-delimited JSON-objects is recognized as JSON, which is understandable, but also incorrect, as NDJSON-document is not a valid JSON: ndjson.txt

Magika version: 0.5.0 Default model: standard_v1

reyammer commented 7 months ago

Thank you for the report. I need to admit I've never heard about ndjson before. Feels very similar to JSONL (i.e., one json per line)?

hunter-gatherer8 commented 7 months ago

@reyammer yes, it's the same thing, different names. NDJSON stands for "Newline-Delimited JSON", and apparently there are 2 separate community-driven specifications with very minor (completely irrelevant for Magika, IMO) differences. But obviously the format itself long predates both specifications, and is just that: a valid JSON-object per line, possibly with some empty lines.

Both communities are aware of each other:

Anecdotally, I've seen only "ndjson" in MIME-types, but it appears jsonl is actually more popular name nowadays, and ndjson pretty much abandoned. So, yeah, you'd be probably better off with "jsonl" as a name.