impresso / federal-gazette

0 stars 0 forks source link

Text Importer for TETXML; mapping into canonical format #7

Closed simon-clematide closed 3 years ago

simon-clematide commented 5 years ago

conversion of TETXML to impresso canonical format

schemas for canonical format

The software should match the overall architecture of the impresso textimporters from various formats.

For instance, the one from mets_alto:

https://github.com/impresso/impresso-text-acquisition/tree/master/text_importer/importers/mets_alto

aflueckiger commented 5 years ago

Programming the method to_json for NewspaperIssue and NewspaperPage: https://github.com/impresso/impresso-text-acquisition/blob/master/text_importer/importers/classes.py

Find actual article boundaries in case they end/start on the same page:

mromanello commented 5 years ago

Hi @aflueckiger ! great to see you are making progress on this. Just a heads-up to say that the library documentation (RTD-style) should be published by the end of today (and the GH repo will become public). That should be helpful to see how a TETML importer would fit into the overall library (but perhaps it's already clear).

mromanello commented 5 years ago

As promised, the documentation is now live here. The Writing a new importer section may be especially useful to you.

Any feedback on the clarity of the docs and possible improvements is very welcome!

aflueckiger commented 5 years ago

Thanks @mromanello. I will have a look at the documentation today and make up my mind about the implementation the TETML importer. Probably, I start programming next week.

aflueckiger commented 5 years ago

Steps to get tetml import working:

questions:

Suggestions documentation:

suggestions coding:

aflueckiger commented 3 years ago

Importer for TETML and FedGaz-TETML is implemented in the following repo: https://github.com/impresso/impresso-text-acquisition/tree/master/text_importer

Documentation for the FedGaz specificities can be found in the README of this repo.