scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

GSoC 2021: Feeds enhancements #4963

Closed ejulio closed 1 year ago

ejulio commented 3 years ago

This is a single issue to discuss feeds enhancements as a project for GSoC 2021.

My idea here is to make a project to work on 3 (or more) improvements detailed below.

1 - filter items on export based on custom rules.

Relevant issues:

There is already a PR for this one (note my last comment there) https://github.com/scrapy/scrapy/pull/4576

However, if the author doesn't reply on time, we can continue the work from the branch and only finish the feature.

2 - Compress feeds

This is an old feature request and there's an issue for it here https://github.com/scrapy/scrapy/issues/2174 The API changed a bit since then, but I think it'd be something like

FEEDS = {
    "myfile.jl": {
        "compression": "gzip"
    }
}

I think gzip is a good starting point, but we should put some effort to design an API that will be extensible and allow different formats.

3 - Spider open/close a batch

Recently we added support for batch delivery in scapy. Say, every X items, we deliver a file and open a new file. Sometimes, we don't know the threshold upfront or it can be based on an external signal. In this case, we should be able to trigger a batch delivery from the spider. I have two possible ideas for it:

Note that, this can be tricky as we allow multiple feeds (so it may require an argument specifying which feed batch to close).

Bhavesh0327 commented 3 years ago

Hey @ejulio @Gallaecio, I am a 3rd-year student from NITT, India. I am planning to work on this project as a part of GSOC this year. Currently, I am looking into the codebase and trying to understand the different functionalities of the code. As of now, I have looked into the conversations in issues mentioned above and the PRs attached. It would be really helpful if you could provide some suggestions on how to approach the issues.

Gallaecio commented 3 years ago

It would be really helpful if you could provide some suggestions on how to approach the issues.

Your question is quite open, so I’m not sure exactly how to help, but in general lines, if what you are looking for is help on how to start addressing these issues, or where in the code you would start, I would also suggest looking at the pull requests closed in the last few Scrapy releases related to feed exports and the FEEDS setting. They should be easy to find from the release notes.

Bhavesh0327 commented 3 years ago

Thanks, yes my question was around where in the code I should look into, I will go through the release notes and related PRs.

drs-11 commented 3 years ago

Hi! I'm sorry I'm starting pretty late but is it mandatory to have a fresh PR as a part of pre-submission task? I already did someearlier could that be counted in?

Earlier PRs: https://github.com/scrapy/scrapy/pull/4752 https://github.com/scrapy/scrapy/pull/4753 https://github.com/scrapy/scrapy/pull/4778

drs-11 commented 3 years ago

Hey @ejulio, wouldn't it make more sense to let Items have some metadata such as acceptance_criteria, storage_params, method_filter rather than passing such filtering criteria to global settings? Then FeedExporter can filter the Item and appropriately handle it.

Bhavesh0327 commented 3 years ago

Hi @drs-11 , one such pr is already linked with the issue where item_classes is used in itemexport to filter out, but as mentioned in the issue I guess it will be better to implement this one in feedexport itself so as we can later use it to add more complex filters if needed.

Bhavesh0327 commented 3 years ago

Also @ejulio I am almost done with an outline of my draft proposal for the project, I have a few confusions though. Can you please give a little explanation on the third feature i.e. Spider close a batch? I think I am missing something as to exactly how are you expecting this feature to work?

drs-11 commented 3 years ago

I understand @Bhavesh0327. I was proposing something else.

By pt 3, I think they mean to close and start a new batch by a custom signal generated by the user or a custom method created by user which is invoked when a certain condition is satisfied so the batch creation is not limited to just a number limit.

Bhavesh0327 commented 3 years ago

Yes, I meant to ask what kind of conditions can be there, because that can vary how the feature will actually be implemented.

drs-11 commented 3 years ago

Now that I think about it, I get your confusion @Bhavesh0327. So far I can only think of 3 such cases: 1) After every x items (already implemented) 2) After every x mins 3) After every x bytes

Expanding on @ejulio's idea, we can use signals to trigger batch deliveries by a separate method in FeedExporter. The above 3 criterias can be builtins. The signal trigger can be overriden by the user using a custom method in their Spider. Question is where should the signal originate from so that the trigger criterias can be made extensible.

drs-11 commented 3 years ago

Hey @Gallaecio, apologies for being impatient but can you address the above concerns so we can adjust our proposals? The GSoC deadline is approaching and the designated mentor is unavailable for the time being it seems.

Bhavesh0327 commented 3 years ago

Hey @ejulio @Gallaecio, I have a small doubt, in the case of batch deliveries, how are we supposed to manage the compression? Either we shall

In the 2nd case, for compression like gunzip, we have to tar all those files into one and then go further

Also thanks @drs-11, that pretty much clears my doubt.

Gallaecio commented 3 years ago

is it mandatory to have a fresh PR as a part of pre-submission task?

Prior pull requests are OK.

Gallaecio commented 3 years ago

Let’s see if I can address your questions:

Sorry for the delay on the feedback. From now until the proposal deadline I hope to be available on a daily basis (Mon-Fri).

Bhavesh0327 commented 3 years ago

Thanks for the feedback @Gallaecio, I have made the necessary changes and submitted my draft proposal for this project. It would be really helpful if you, @ejulio, and other mentors could provide feedback on it as well.

https://docs.google.com/document/d/1mUKG4_OOIRl7MNjSYPr3hr7026j2Ws3tV810lDD13bo/edit?usp=sharing

ejulio commented 3 years ago

Hey everyone. Sorry for the delay here. I was on a strict schedule last week and I only had time to catch up today.

  1. Regarding the compression, as @Gallaecio mentioned, it should be per file. When we write the file to disk, it should be compressed. It shouldn't be a compression of multiple batches.

  2. On signals. My idea there is that some times, we don't have a specific threshold, say every 10k items. Maybe, we want to close a batch once we finish to scrape a category for example, assuming we scrape categories in a sequential manner. As we don't know the time/size of the category, it can be tricky to do so, specially, because we don't have direct access to the exporter from the spider. So, the idea is to create this connection. It can be a function, but where? It can be a signal, probably better in terms of decoupling.

drs-11 commented 3 years ago

it can be tricky to do so, specially, because we don't have direct access to the exporter from the spider

Exactly what I've been wondering. I've been thinking of pluginable methods for this but the exposed interface to the user(ie settings.py, pipelines.py, etc.) does't really provide an easy way to give methods to FeedExporter.

Bhavesh0327 commented 3 years ago

Thanks, @ejulio, that pretty much clears it. Although in my opinion, the time/size feature can also be tried in addition to the signal approach. It might have some good use cases.

drs-11 commented 3 years ago

@ejulio, @Gallaecio I wanted some feedback as well for the draft proposal I made. What do you suggest how should I present it? I made it in markdown so it can be inconvenient to comment on it. Should I transfer it Google Docs?

Gallaecio commented 3 years ago

@drs-11 You could upload it as markdown to a GitHub repository, for easy commenting. If you prefer Google Docs, though, converting it to HTML and copy-pasting from your web browser into a Google Docs document should do the job.

Gallaecio commented 3 years ago

@Bhavesh0327 About your proposal:

Bhavesh0327 commented 3 years ago

Thanks for the feedback @Gallaecio, I will look into all those points you suggested, and make the improvements asap.

drs-11 commented 3 years ago

Here's my draft proposal. Please take a look at mine as well @Gallaecio, @ejulio. I have only filled in the technical details for now. I will fill the rest soon.

Gallaecio commented 3 years ago

@drs-11 Some feedback:

drs-11 commented 3 years ago

Thanks!

drs-11 commented 3 years ago

@Gallaecio do you disapprove the additional methods I proposed for criteria checking in Item and Field class? I thought it would solve the flexibility issue though I was not sure if it's a good design.

Gallaecio commented 3 years ago

They address the specific example I discussed (price == 0, price > 0), but that was just an example. My point is that some users may need a complex logic to determine whether an item goes into a feed or not, and instead of coming up with a way to express a complex logic in JSON, it would be best to make it possible to write actual code for scenarios where users need so.

drs-11 commented 3 years ago

Ok yeah that makes sense, JSON interface would be pretty limited. Thanks!

ejulio commented 3 years ago

hey @drs-11 @Bhavesh0327

  1. I don't see the need for archiving. If the user wants a single file, he/she simple doesn't set a batching criteria. If they want batches, probably they don't want to aggregate the batches as they want the data to be delivered ASAP

  2. My take on the custom batch size is: Say that I'm scraping 10 categories from amazon and I want each category to be delivered in a single file as soon as all the products were collected. I can write a routing strategy with filters and then, in the spider, I need to send a "message" to the exporter saying that a given batch/file can be closed. The easiest way I see to do that is by having FeedExporter listening to a close_batch signal (that we need to create) and that it triggers by itself when a batch is closed. The thing is that, by having this signal in place, the spider can trigger it too https://docs.scrapy.org/en/latest/topics/api.html#scrapy.signalmanager.SignalManager.send_catch_log_deferred

@drs-11 , I liked your approach to register the triggers. We just need to know how it would work in practice.

drs-11 commented 3 years ago

@Gallaecio I've been thinking about "post-processing" as a feature. The way it should go would be: export all items to the target file -> use post processing components on the file -> overwrite the file with the processed data.

But the problem arises for compression as no in-memory compression will be able to take place. So we will have to use a temporary file to store the compressed data and then use that data to overwrite the target file. So if a big file happens to be our target file, there will be some heavy I/O usage for that. Will that inefficiency be a big problem?

drs-11 commented 3 years ago

As @Gallaecio mentioned minifying, beautification, compression. Apart from this I can think of some sort of report generation perhaps?

drs-11 commented 3 years ago

@Gallaecio, @ejulio I updated the proposal. Another round of feedbacks please!

Gallaecio commented 3 years ago

I've been thinking about "post-processing" as a feature. The way it should go would be: export all items to the target file -> use post processing components on the file -> overwrite the file with the processed data.

I’m not sure that’s the way it should do. Some post-processing would not need to do that. In fact, you mention compression as something that cannot happen in memory, but I think some (most?) compression algorithms do not need the whole file as input, and can compress during data streaming. I’m assuming that the Python GZip file-like object writes to disk as you write data into it, and not everything at once when you close the file. I may be wrong, though.

For cases where an intermediary file is really needed, I would not have the post-processing component overwrite the output file at the end. I think the target file should only be written once, to not complicate things when remote storage is involved (S3, GCS, FTP) or there are options like overwrite to take into account. Instead, I would let the post-processing component handle that on its own: if it needs an intermediary, temporary file, I can create one on its own in the local disk, and then read from it into the ultimate target file at the end.

As @Gallaecio mentioned minifying, beautification, compression. Apart from this I can think of some sort of report generation perhaps?

If it’s a report about items, I think item pipelines are more appropriate for such a feature. And I’m not sure this is something we should provide built-in in Scrapy. Maybe it makes more sense as a Spidermon feature, I don’t know.

If it’s a report about the post-processing instead (e.g. how effective compression was), I’m guessing the specific post-processing implementation could log some minimal stats as INFO, but I would not put much effort on it.

@Gallaecio, @ejulio I updated the proposal. Another round of feedbacks please!

Bhavesh0327 commented 3 years ago

Hii @Gallaecio, I have one last small doubt left for my proposal. I was planning to add the multi-part upload feature for S3 storage. I went through the PRs and issues linked with it, telling how rewriting the existing patch and shifting it to boto3 will remove the existing errors. Can you please provide a little more input on what you are expecting from this patch?

Gallaecio commented 3 years ago

Can you please provide a little more input on what you are expecting from this patch?

I would summarize it as rewriting S3FeedStorage using boto3’s upload_fileobj method, which automatically uses multi-part support for big files.