Extract fields from unzipped XML?

rmzelle / ref-extractor

Reference Extractor - Extract Zotero/Mendeley references from Microsoft Word files

https://rintze.zelle.me/ref-extractor/

MIT License

332 stars 20 forks source link

Extract fields from unzipped XML? #14

Closed fbennett closed 7 years ago

fbennett commented 7 years ago

It seems like you could extract field codes directly from the exploded XML source of the document. That would save the mammoth.js dependency, and should work (with a bit of tweaking maybe?) for both *.docx and *.odt source files.

rmzelle commented 7 years ago

@fbennett, yes, I didn't bother once I got mammoth.js to work, but it would be a lot cleaner to just unzip the files and loop over the relevant XML elements. (would be happy to take a PR 😄)

Looks like http://stuk.github.io/jszip/ or https://gildas-lormeau.github.io/zip.js/index.html might be decent libraries for reading the files (*.odt is also just a zip file? See https://gildas-lormeau.github.io/zip.js/core-api.html#zip-reading-example in particular). And just some XPath for the XML (https://developer.mozilla.org/en-US/docs/Introduction_to_using_XPath_in_JavaScript)?

rmzelle commented 7 years ago

Any particular reason you're interested in supporting *.odt documents?

fbennett commented 7 years ago

No pressing need, was only thinking of completeness and convenience. (To my surprise, the cite-extraction code in Juris-M still works after the migration to 5.0, so that alternative is still open to an LO user.)

rmzelle commented 7 years ago

and should work (with a bit of tweaking maybe?) for both .docx and .odt source files.

@simonster, can I bug you for a second? I wasn't sure if you still follow zotero-dev, and I have a question about how Zotero citation metadata is embedded in .odt files, which is probably your code. See https://groups.google.com/forum/#!topic/zotero-dev/vImXuhjsFw0