iscc / iscc-specs

ISCC: International Standard Content Code
http://iscc.codes
Other
47 stars 9 forks source link

Change wording for text extraction scope. #53

Open titusz opened 5 years ago

titusz commented 5 years ago

Currently: "While text-extraction is out of scope for this specification ..."

Proposed Change: "While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."

For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.

lrosenthol commented 4 years ago

the definition of the extraction tool/version is part of the normative specification

Mandating a specific tool only works if and only if you also tie it to a version of that tool (as you may be implying). But since software is known to have vulnerabilities that would require systems to update - it is unreasonable/unacceptable to take this approach.

Additionally, it would prevent innovation in this area especially in complex formats such as PDF.