okfn-brasil / querido-diario

📰 Diários oficiais brasileiros acessíveis a todos | 📰 Brazilian government gazettes, accessible to everyone.
https://queridodiario.ok.org.br/
MIT License
1.11k stars 409 forks source link

Parsing files #223

Closed ogecece closed 3 years ago

ogecece commented 4 years ago

Hey,

Following @jvanz 's request at #195, this is a issue to discuss parsers to extract information from the downloaded gazette files.

At that PR I made a parser prototype for a spider (PE - São José do Egito) to solve the problem of having multiple cities in a single gazette file, being able to find and parse only the text corresponding to that specific municipality. I guess it is already able to operate on other spiders, but that would require a bunch of testing.

Another PR that was trying to implement a parser (this time for BA - Feira de Santana) is #106. Which makes a pretty ambitious approach of extracting bidding exemptions' structured data and was developed using TDD.

Both approaches depend heavily on regexes and I don't think there's a solution that can avoid that which doesn't require a large quantity of labeled data.

The prototype I submitted also can use some layout information (bold, italic, position in page, font, page number, etc.). If those prove to be reliable, they can be used in conjunction with the regex. This information is provided by poppler's pdftohtml and then formatted in a "easier to parse" xml specific for the purposes of the prototype. Then, the xml can be consumed by extractors that iterate over the text elements and some extractors can operate on top of other extractors, etc. Everything being regulated by some conditions which are normal functions that can be imperatively programmed like normal. The rest I tried to make as declarative as possible (thinking it could make implementing a parser easier).

I think stuff like OCR can be added too to make paragraph detection easier (PDFs are a mess) and adding more layout information.

Anyway, this prototype is trying to start (or restart) a discussion about how we could parse stuff here. It can be modified in it's entirety. Thrown into the garbage can. Improved (docstrings and tests :100:, encodings and other bugs :sob:). But if it renders meaningful discussion it's purpose is complete.

Hope you all the best!

rennerocha commented 4 years ago

Parsing file content is something desired since I started to contribute on this project and it not that easy as the gazettes don't follow any pattern (and are in PDF, which is a nightmare to get its content structured).

@jvanz Do you think it would be worthwhile to create a new project repository and place there all code related to parsing? In my opinion this could make easier for new contributors that wants to be focused only in scraping and/or parsing and we remove the dependency between these tasks (for the parsing, it won't matter how the gazette was downloaded).

jvanz commented 4 years ago

@jvanz Do you think it would be worthwhile to create a new project repository and place there all code related to parsing? In my opinion this could make easier for new contributors that wants to be focused only in scraping and/or parsing and we remove the dependency between these tasks (for the parsing, it won't matter how the gazette was downloaded).

Yes, it does. This is in my TODO list while working in the Querido Diário API. As you already mentioned, make sense to split the crawler from the data parsing and manipulation. This make even more sense if we think in the long term. We will need a data processing pipeline (outside scrapy pipeline). OKBR is also talking to some people to help in the data processing step and if we get some algorithm from them we need to had that.

jvanz commented 4 years ago

https://github.com/okfn-brasil/querido-diario-data-processing

ogecece commented 4 years ago

Nice! I good starting point could be moving the text extraction part of this project to there?

I could do that. Before making other suggestions like the one that was in #195.

ogecece commented 4 years ago

Parsing file content is something desired since I started to contribute on this project and it not that easy as the gazettes don't follow any pattern (and are in PDF, which is a nightmare to get its content structured).

That's what I thought too :/ I'm thinking a lot about how to make the development of these parsers easy and flexible, to acomodate this variability

alexandrevicenzi commented 4 years ago

I was looking the current parsers for pdf/doc to text and they do not produce a nice or even usable output sometimes. Tika seems better than PDF to Text.

Latex does provide structure but not sure how good it would be to convert pdf/doc to Latex. Doc and Docx I think it would produce good results but not so sure about PDFs.

For the scenario where a Diario has multiple cities it could work as each part should have a new header stating the city's name, but if you want to dig further and get more details of its content you have two options, structure the data so it can be read in a structured format, or do some magic with NLP and/or another ML method to learn from past Diarios and be able to identify sections, but this is as good as they follow the same layout most of the time.

Instead of thinking big and trying to solve a problem that we do not have yet, we could just figure out a way of splitting a Diario by cities, make a lib, in the future someone could figure out a better way and create lib v2. Create something simple and dumb that works most of the time, just to get a few more cities on-board, even if not 100%.

alexandrevicenzi commented 4 years ago

So, I was thinking about and it's kinda possible to create a custom PDF parser, or even Doc and Docx parser, that would parse it in a way that we could customize. I'm not sure about libraries in Python that would allow me to do what I'm thinking, but not even sure if it needs to be Python at all.

PDF, DOC and as well follow some standards, they have structure, that's how Evince, LibreOffice and others can parse and show well formatted. What is missing in these are a way to hook things up.

I've been thinking about a custom parser where we could hook actions once we found something. For example, found a header? Hook something to look up into the header and extract the text, if it is "PREFEITURA MUNICIPAL DE CARACARAÍ" it means it should be a new archive for another city. Found an image? Can run some OCR on it. Found a table? Can format is as Markdown and so on.

If you look SIGPub PDFs from associations, they all look very alike, they share the same PDF structure, which means that parsing it and trying to extract stuff in a certain way would work most of the time.

With a custom PDF parser that allows us to hook actions and we can easily create a SIGPubPDFParser for example, the format changed? Not a big problem. There's another PDF layout to be parsed? Create another class.

What would be more interesting is if it could take some sort of template as input and it knows that to do looking in this template, not by coding stuff manually.

I'm not sure if there is any tool that would allow us to do such things.

WDYT? Is it a good idea? Should we research more about this? Is there any tool that allows us to do this?

jvanz commented 4 years ago

Nice! I good starting point could be moving the text extraction part of this project to there?

I could do that. Before making other suggestions like the one that was in #195.

Yes, that's the initial plan. I would like to do that during the last weekend. But I had to work during the period. If you can do that, awesome. I`m thinking about how to make it run in production. For now, you can create a directory in the repo (e.g. data extraction) and add a function which expectes a file descriptor or a file path and return the text.

ogecece commented 4 years ago

For the scenario where a Diario has multiple cities it could work as each part should have a new header stating the city's name, but if you want to dig further and get more details of its content you have two options, structure the data so it can be read in a structured format, or do some magic with NLP and/or another ML method to learn from past Diarios and be able to identify sections, but this is as good as they follow the same layout most of the time.

Instead of thinking big and trying to solve a problem that we do not have yet, we could just figure out a way of splitting a Diario by cities, make a lib, in the future someone could figure out a better way and create lib v2. Create something simple and dumb that works most of the time, just to get a few more cities on-board, even if not 100%.

That's the same line of thought I'm at right now! :)

While trying to solve the problem of multiple cities in the same document for #195 I actually built a parser prototype that gets only the text content that is related to the city of São José do Egito - PE from the documents. I removed it from the PR and since we are moving data processing to another repo, we can continue to work on that later with a new architecture. If you get interested in this parser, take a look at the branch parser at the fork and run the spider pe_sao_jose_do_egito.

I think it is quite similar to your ideas :)

ogecece commented 3 years ago

I'm closing this because it doesn't relate to this repo anymore. Refer to querido-diario-data-processing or the equivalent Text Processing unit of Querido Diário.