okfn-brasil / querido-diario

📰 Diários oficiais brasileiros acessíveis a todos | 📰 Brazilian government gazettes, accessible to everyone.
https://queridodiario.ok.org.br/
MIT License
1.1k stars 409 forks source link

Add possibility to store information of more than one file per-gazette #313

Closed rennerocha closed 4 years ago

rennerocha commented 4 years ago

In a gazette item, we consider that each gazette is an unique file, so we store only the information of one file.

This is true most of the times, but it is possible that the same gazette is splitted in more than one file like we have in Piracicaba-SP that for some gazettes, it contains two PDF files (each file has a maximum of 50 pages).

So we may lose important information that will be necessary when we process the downloaded files and extract their content.

ogecece commented 4 years ago

:/

What about a order_in_sequence integer field? Just spilling thoughts right now.

If we can get the order from the system (pages, chunks, etc.), great. When processing we can concatenate in a ascending order that the spider extracted.

If we can't. Well, there's not much to do in the spider and we try to solve it when processing. Maybe case by case.

rennerocha commented 4 years ago

This is not really a problem in the Spider. We already pass a list of URLs in file_urls field and Scrapy will download every file from there. Usually we have only one file, but it seems that we have some few cities that split the gazettes in more than one file.

The problem will appear when accessing the database table. This will be required by the processing step, when we extract the text from the downloaded files. In my opinion, the way to solve it is to change the database structure to allow us to store the information of as many downloaded files we have.

Something like:

class Gazette(DeclarativeBase):
    __tablename__ = "gazettes"
    id = Column(Integer, primary_key=True)

class GazetteFile(DeclarativeBase):
    __tablename__ = "gazettefiles"
    sequence = Column(Integer)  # This acts like order_in_sequence you suggested
    checksum = Column(String)
    path = Column(String)
    url = Column(String)
    gazette = relationship("Gazette", back_populates="files")

And then update SQLDatabasePipeline to deal with this new structure (considering that we should not add duplicated entries if the checksum is the same), etc .

ogecece commented 4 years ago

That's what I was thinking :)

jvanz commented 4 years ago

Another table is really necessary? In the current gazette table there is a constraints which consider the file checksum. So, in theory, we can have the same territory_id, date but with different checksums. I do not dive deep in the Piracicaba-SP spider. But if the split is just because of the size of the file, I think we can have multiple files in the current gazette table. Does it make sense to you? Am I missing something? Should we add the edition_number in the constraints? Just sharing what I have in mind right now. I need to double check, specially corner cases, that later.

rennerocha commented 4 years ago

For example this gazette: https://diariooficial.piracicaba.sp.gov.br/2020/10/08/ We have one file for the first 30 pages and a second file for the last 35 pages. They are the same edition (btw we are unable to get edition number in the spider). Both need to be downloaded and we need to store the checksum of both files.

A second table was my first thought (just to open the discussion), but it is not necessarily the only or the best solution. Considering the constrains we have in Gazette model (as you mentioned) it should be possible to store more than one entry for the same date (actually we are already doing this when we have extra editions for the same date). This solution requires just small changes in SQLDatabasePipeline.

I think this would be more straightforward to work and if in future we need a second table, it will not be complicated to create a migration for existing data.

jvanz commented 4 years ago

I think this would be more straightforward to work and if in future we need a second table, it will not be complicated to create a migration for existing data.

Agree. Furthermore, there are spiders which already have multiples files for the same territory and date. Take a look in this example from the FECAM website. There are multiple files for the same date and city. It has a single file for each publication. In this example, the spider returns an item for each file. That's why we have not faced this issue before.

I'm just afraid that I'm forgetting something or I do not know how organize properly the gazettes. For the text extraction perspective. This does not matter to much. It just need a file location. My concern is when we want to extract more useful information from that text. But I think we can fix that in the future if necessary.