GSoC 2021: Feeds enhancements

ejulio commented 3 years ago

This is a single issue to discuss feeds enhancements as a project for GSoC 2021.

My idea here is to make a project to work on 3 (or more) improvements detailed below.

1 - filter items on export based on custom rules.

Relevant issues:

There is already a PR for this one (note my last comment there) https://github.com/scrapy/scrapy/pull/4576

However, if the author doesn't reply on time, we can continue the work from the branch and only finish the feature.

2 - Compress feeds

This is an old feature request and there's an issue for it here https://github.com/scrapy/scrapy/issues/2174 The API changed a bit since then, but I think it'd be something like

FEEDS = {
    "myfile.jl": {
        "compression": "gzip"
    }
}

I think gzip is a good starting point, but we should put some effort to design an API that will be extensible and allow different formats.

3 - Spider open/close a batch

Recently we added support for batch delivery in scapy. Say, every X items, we deliver a file and open a new file. Sometimes, we don't know the threshold upfront or it can be based on an external signal. In this case, we should be able to trigger a batch delivery from the spider. I have two possible ideas for it:

Create a new signal: scrapy.signals.close_batch
Add a method that can be called from the spider spider.feed_exporter.close_batch()

Note that, this can be tricky as we allow multiple feeds (so it may require an argument specifying which feed batch to close).

Bhavesh0327 commented 3 years ago

Hey @ejulio @Gallaecio, I am a 3rd-year student from NITT, India. I am planning to work on this project as a part of GSOC this year. Currently, I am looking into the codebase and trying to understand the different functionalities of the code. As of now, I have looked into the conversations in issues mentioned above and the PRs attached. It would be really helpful if you could provide some suggestions on how to approach the issues.

Gallaecio commented 3 years ago

It would be really helpful if you could provide some suggestions on how to approach the issues.

Your question is quite open, so I’m not sure exactly how to help, but in general lines, if what you are looking for is help on how to start addressing these issues, or where in the code you would start, I would also suggest looking at the pull requests closed in the last few Scrapy releases related to feed exports and the FEEDS setting. They should be easy to find from the release notes.

Bhavesh0327 commented 3 years ago

Thanks, yes my question was around where in the code I should look into, I will go through the release notes and related PRs.

drs-11 commented 3 years ago

Hi! I'm sorry I'm starting pretty late but is it mandatory to have a fresh PR as a part of pre-submission task? I already did someearlier could that be counted in?

Earlier PRs: https://github.com/scrapy/scrapy/pull/4752 https://github.com/scrapy/scrapy/pull/4753 https://github.com/scrapy/scrapy/pull/4778

drs-11 commented 3 years ago

Hey @ejulio, wouldn't it make more sense to let Items have some metadata such as acceptance_criteria, storage_params, method_filter rather than passing such filtering criteria to global settings? Then FeedExporter can filter the Item and appropriately handle it.

Bhavesh0327 commented 3 years ago

Hi @drs-11 , one such pr is already linked with the issue where item_classes is used in itemexport to filter out, but as mentioned in the issue I guess it will be better to implement this one in feedexport itself so as we can later use it to add more complex filters if needed.

Bhavesh0327 commented 3 years ago

Also @ejulio I am almost done with an outline of my draft proposal for the project, I have a few confusions though. Can you please give a little explanation on the third feature i.e. Spider close a batch? I think I am missing something as to exactly how are you expecting this feature to work?

drs-11 commented 3 years ago

I understand @Bhavesh0327. I was proposing something else.

By pt 3, I think they mean to close and start a new batch by a custom signal generated by the user or a custom method created by user which is invoked when a certain condition is satisfied so the batch creation is not limited to just a number limit.

Bhavesh0327 commented 3 years ago

Yes, I meant to ask what kind of conditions can be there, because that can vary how the feature will actually be implemented.

drs-11 commented 3 years ago

Now that I think about it, I get your confusion @Bhavesh0327. So far I can only think of 3 such cases: 1) After every x items (already implemented) 2) After every x mins 3) After every x bytes

Expanding on @ejulio's idea, we can use signals to trigger batch deliveries by a separate method in FeedExporter. The above 3 criterias can be builtins. The signal trigger can be overriden by the user using a custom method in their Spider. Question is where should the signal originate from so that the trigger criterias can be made extensible.

drs-11 commented 3 years ago

Hey @Gallaecio, apologies for being impatient but can you address the above concerns so we can adjust our proposals? The GSoC deadline is approaching and the designated mentor is unavailable for the time being it seems.

Bhavesh0327 commented 3 years ago

Hey @ejulio @Gallaecio, I have a small doubt, in the case of batch deliveries, how are we supposed to manage the compression? Either we shall

Compress each of those files separately, or
Compress all the files into one file at the end

In the 2nd case, for compression like gunzip, we have to tar all those files into one and then go further

Also thanks @drs-11, that pretty much clears my doubt.

Gallaecio commented 3 years ago

is it mandatory to have a fresh PR as a part of pre-submission task?

Prior pull requests are OK.

Gallaecio commented 3 years ago

Let’s see if I can address your questions:

First off, the 3 ideas here are suggestions. Just as you can add more improvements to your proposal, you can exclude some of these improvements from your proposal.
About flexible batching, it’s not clear to me either what would be the best way to implement this. If we could come up with a way to allow for maximum flexibility here (e.g. imagine you need to call an HTTP API to determine if you should start a new batch), that would be awesome. However, supporting splitting batches by bytes or at specific time intervals should be more straightforward to implement, and those are I think the most common scenarios, so I think you could go for that instead in your proposal.
As for compression and batch delivery, I would go for compressing each file separately. While compressing the whole output may be a valid use case, I believe the main point of batch delivery is to get access early to part of the output of a spider, and compressing all files into a single compressed archive would go against that.

Sorry for the delay on the feedback. From now until the proposal deadline I hope to be available on a daily basis (Mon-Fri).

Bhavesh0327 commented 3 years ago

Thanks for the feedback @Gallaecio, I have made the necessary changes and submitted my draft proposal for this project. It would be really helpful if you, @ejulio, and other mentors could provide feedback on it as well.

https://docs.google.com/document/d/1mUKG4_OOIRl7MNjSYPr3hr7026j2Ws3tV810lDD13bo/edit?usp=sharing

ejulio commented 3 years ago

Hey everyone. Sorry for the delay here. I was on a strict schedule last week and I only had time to catch up today.

Regarding the compression, as @Gallaecio mentioned, it should be per file. When we write the file to disk, it should be compressed. It shouldn't be a compression of multiple batches.
On signals. My idea there is that some times, we don't have a specific threshold, say every 10k items. Maybe, we want to close a batch once we finish to scrape a category for example, assuming we scrape categories in a sequential manner. As we don't know the time/size of the category, it can be tricky to do so, specially, because we don't have direct access to the exporter from the spider. So, the idea is to create this connection. It can be a function, but where? It can be a signal, probably better in terms of decoupling.

drs-11 commented 3 years ago

it can be tricky to do so, specially, because we don't have direct access to the exporter from the spider

Exactly what I've been wondering. I've been thinking of pluginable methods for this but the exposed interface to the user(ie settings.py, pipelines.py, etc.) does't really provide an easy way to give methods to FeedExporter.

Bhavesh0327 commented 3 years ago

Thanks, @ejulio, that pretty much clears it. Although in my opinion, the time/size feature can also be tried in addition to the signal approach. It might have some good use cases.

drs-11 commented 3 years ago

@ejulio, @Gallaecio I wanted some feedback as well for the draft proposal I made. What do you suggest how should I present it? I made it in markdown so it can be inconvenient to comment on it. Should I transfer it Google Docs?

Gallaecio commented 3 years ago

@drs-11 You could upload it as markdown to a GitHub repository, for easy commenting. If you prefer Google Docs, though, converting it to HTML and copy-pasting from your web browser into a Google Docs document should do the job.

Gallaecio commented 3 years ago

@Bhavesh0327 About your proposal:

Your proposal for item filtering seems limited. For example, imagine I am extracting product items with a price field, and I want to export free products (price is 0) to 1 file and paid products (price is higher than 0) to a different file.

One way to offer maximum flexibility would be to provide a single option that accepts a ‘Scrapy component’, a user-defined class that we import in Scrapy code with the load_object. That class could have a method that determines whether a given item goes to a given feed.

The default such Scrapy component could in fact be a refactoring of the existing code that filters based on item_classes.

Have a look at usages of load_object in the Scrapy code base to get a grasp on how that works.
Your proposal for file compression is based on https://github.com/scrapy/scrapy/pull/4131, but with a dictionary instead of a boolean to support further flexibility.

However, given that different compression libraries will offer different parameters, I wonder if maybe we could support a Scrapy component here as well instead of a dictionary, to make things more flexible for users that need something more complex (e.g. see lzma.LZMACompressor parameters).

Maybe it could actually not be a parameter limited to compression, but a “post-processing” parameter, that could be used to do stuff other than standard compression, such as minifying or pretty-printing. And we could offer built-in values for common post-processing, such as GZip compression.
You mention both gzip and pigz. I did not know the latter, but from what I can read it is an alternative implementation of gzip compression that uses multiple cores to speed up compression. This is great, but I would not treat this as a different compression, it is still gzip. What I think we could to is to have the gzip compression use pigz if the corresponding Python library is installed, and if not, use the Python standard library.
You also mention 7z. I just wanted to clarify that it’s not really a compression algorithm, but an archiving format (the actual compression algorithm that it uses is LZMA). Not that it matters, people may still want to compress in 7z or xz, even if they are not compression algorithms.

I guess we could aim for gzip, bzip2 and lzma compressions (all supported in the Python standard library) and tar, zip, and 7z archiving formats (the first 2 are supported in the Python standard library). 7z would require an external library, so maybe it would make sense to keep it as a stretch goal.

In fact, if we make the system flexible enough so that users can implement their own compressors, it may be best to initially implement only 1 type of compression (e.g. gzip, without an archiving format given we are compressing single files), and make supporting additional compression algorithms and archive formats as a stretch goal.
You mention https://github.com/scrapy/scrapy/pull/4131#issuecomment-612565249 as a reason not to do gzip compression locally before uploading to S3. However, I don’t think this is reason enough. It must be possible to refactor the code to work in this scenario as well. It must be possible to upload a compressed file to S3.

In this case, I think the issue is just that the GzipFile class is a read-only file. Which only means that the file needs to be closed and re-opened as a regular, binary file in order to upload it.
While I don’t think compression should require changes on the S3-specific code, I do think that one additional improvement that you may want to consider as a feed enhancement is to bring multi-part upload support to S3.

This long-awaited feature could be implemented by rewriting S3FeedStorage to use boto3’s upload_fileobj, which not only should make the code cleaner, but also handle multi-part upload automatically for big files.
I like the idea of supporting different syntax on the same setting to indicate a splitting criteria based on time, size or number or items. And exposing that as a command-line switch may be good.
I’m not sold on the suggested API for the more-flexible approach to batch splitting. How would this work with multiple feeds? I’m also not happy about adding a spider method for this, I don’t think that’s in line with common Scrapy API practices.

I would rather try to figure out a way to refactor the current batch splitting as a Scrapy component that users can replace, so that users that need some more flexible batch splitting can create their own splitting component that allows to split batches through a method to be called from a spider.

Bhavesh0327 commented 3 years ago

Thanks for the feedback @Gallaecio, I will look into all those points you suggested, and make the improvements asap.

drs-11 commented 3 years ago

Here's my draft proposal. Please take a look at mine as well @Gallaecio, @ejulio. I have only filled in the technical details for now. I will fill the rest soon.

Gallaecio commented 3 years ago

@drs-11 Some feedback:

Part of my feedback above applies to your proposal as well.
Regarding feed compression option 1, how do you plan to implement it, where in the code would you handle the creation of such an archive? How would it work in combination with individual feeds that do not target local storage but S3, GSC or FTP?
Regarding feed compression option 2, while I agree that archiving is not needed for a single file, I think it may still be desired by some users. So I’m not sure that it makes sense to make archiving depend on whether or not batches are used. Also, what happens if you enable batches but you only get 1 file?
About batches, what would be the arguments that Batch.update gets? Where would the Bacth class be created? Where would update be called?

drs-11 commented 3 years ago

Thanks!

drs-11 commented 3 years ago

@Gallaecio do you disapprove the additional methods I proposed for criteria checking in Item and Field class? I thought it would solve the flexibility issue though I was not sure if it's a good design.

Gallaecio commented 3 years ago

They address the specific example I discussed (price == 0, price > 0), but that was just an example. My point is that some users may need a complex logic to determine whether an item goes into a feed or not, and instead of coming up with a way to express a complex logic in JSON, it would be best to make it possible to write actual code for scenarios where users need so.

drs-11 commented 3 years ago

Ok yeah that makes sense, JSON interface would be pretty limited. Thanks!

ejulio commented 3 years ago

hey @drs-11 @Bhavesh0327

I don't see the need for archiving. If the user wants a single file, he/she simple doesn't set a batching criteria. If they want batches, probably they don't want to aggregate the batches as they want the data to be delivered ASAP
My take on the custom batch size is: Say that I'm scraping 10 categories from amazon and I want each category to be delivered in a single file as soon as all the products were collected. I can write a routing strategy with filters and then, in the spider, I need to send a "message" to the exporter saying that a given batch/file can be closed. The easiest way I see to do that is by having FeedExporter listening to a close_batch signal (that we need to create) and that it triggers by itself when a batch is closed. The thing is that, by having this signal in place, the spider can trigger it too https://docs.scrapy.org/en/latest/topics/api.html#scrapy.signalmanager.SignalManager.send_catch_log_deferred

@drs-11 , I liked your approach to register the triggers. We just need to know how it would work in practice.

drs-11 commented 3 years ago

@Gallaecio I've been thinking about "post-processing" as a feature. The way it should go would be: export all items to the target file -> use post processing components on the file -> overwrite the file with the processed data.

But the problem arises for compression as no in-memory compression will be able to take place. So we will have to use a temporary file to store the compressed data and then use that data to overwrite the target file. So if a big file happens to be our target file, there will be some heavy I/O usage for that. Will that inefficiency be a big problem?

drs-11 commented 3 years ago

As @Gallaecio mentioned minifying, beautification, compression. Apart from this I can think of some sort of report generation perhaps?

drs-11 commented 3 years ago

@Gallaecio, @ejulio I updated the proposal. Another round of feedbacks please!

Gallaecio commented 3 years ago

I've been thinking about "post-processing" as a feature. The way it should go would be: export all items to the target file -> use post processing components on the file -> overwrite the file with the processed data.

I’m not sure that’s the way it should do. Some post-processing would not need to do that. In fact, you mention compression as something that cannot happen in memory, but I think some (most?) compression algorithms do not need the whole file as input, and can compress during data streaming. I’m assuming that the Python GZip file-like object writes to disk as you write data into it, and not everything at once when you close the file. I may be wrong, though.

For cases where an intermediary file is really needed, I would not have the post-processing component overwrite the output file at the end. I think the target file should only be written once, to not complicate things when remote storage is involved (S3, GCS, FTP) or there are options like overwrite to take into account. Instead, I would let the post-processing component handle that on its own: if it needs an intermediary, temporary file, I can create one on its own in the local disk, and then read from it into the ultimate target file at the end.

As @Gallaecio mentioned minifying, beautification, compression. Apart from this I can think of some sort of report generation perhaps?

If it’s a report about items, I think item pipelines are more appropriate for such a feature. And I’m not sure this is something we should provide built-in in Scrapy. Maybe it makes more sense as a Spidermon feature, I don’t know.

If it’s a report about the post-processing instead (e.g. how effective compression was), I’m guessing the specific post-processing implementation could log some minimal stats as INFO, but I would not put much effort on it.

@Gallaecio, @ejulio I updated the proposal. Another round of feedbacks please!

Item filtering:
- Your proposal supports defining multiple item-filtering components. However, it does not provide much flexibility for altering the order of components. I wonder if an approach more like that of the downloader middlewares setting would make more sense. As for the aliasing feature that you seem to aim to allow with the current proposal for the setting, I think since Scrapy 2.4.0 there is no need for that: from code you can use the class directly, as code, so you don’t need even quotes to make it a string.
  
  On the other hand, I’m not sure if allowing for a chain of item filtering objects is needed at all. Since writing a custom item filter is possible, and writing it by inheriting from the default one is also possible, I wonder whether or not allowing a chain of such components is worth it. I’m undecided, though, so I’m not saying we should not support it, I’m just saying that I would also be OK if only 1 item-filtering object was allowed per feed.
- I think the default item-filtering component could implement support for some common item-filtering use cases through feed export options, such as #4575. And custom item-filtering components may want to support custom feed options to allow modifying their behavior on a per-feed basis. However, the proposed API does not give access to feed options, only to settings (Scrapy components can access settings by defining a from_crawler class method and reading them from crawler.settings). I’m guessing that the feed options should be passed either to __init__ or to some other method of the item-filtering component, depending on where and how you plan to implement the interaction of item-filtering components with the exiting feed export code.
In line with what I mentioned about about post-processing, I think that the interface of post-processing components/plugins should allow for the post-processing to be done in a streaming fashion, rather than at the end. Not every component/pluging will support that, but some may.

I think conceptually they should work similar to Unix pipelines. Maybe they could be similar to Python file objects that have a write method that receives arbitrary bytes, a close method to indicate when there is no more writing, and they could get a file-like object with the same methods that points to the next component/plugin or to the target file if there are no more components/plugins.

In any case, I think we should not rely on the target file allowing read operations in addition to write operations. I think having to rewrite an item in place instead of allowing for a pipeline-like approach could impose unneeded performance limitations, i.e. require extra disk input/output for operations that could be handled in-memory in a pipeline-like approach.
About batch delivery:
- ```
{
    'items.json': {
        'format': 'json',
        'batch_constraint': ('custom', 10)
    },
}

FEED_BATCH_TRIGGER = {
    'custom': 'myproject.customclassfile.CustomBatch'
}
```
  I don’t see a need to have FEED_BATCH_TRIGGER to define aliases. In general, I don’t think I would implement a Scrapy setting just for aliasing.
  
  In code, you can use classes directly (i.e. CustomBatch) instead of the import path, so (CustomBatch, 10), which is not that long.
  
  In text-based settings, while writing the whole import path is more verbose, it’s something you only write once, and that way you don’t need to look up FEED_BATCH_TRIGGER to figure out the path.
- Could you clarify what’s para_val, with some examples? I’m also not sure what Batch.update is meant to do, or why slot_file is optional there if according to the pseudo-code below it is always passed.
- I’m assuming that the 10 in ('custom', 10) could be any arbitrary Python object, for example to allow for batch-splitting components to have complex configuration (e.g. multiple, named values). If so, it may be best to make it explicit in the proposal.
- I don’t think we should coupling Spider with the FeedExporter extension by defining Spider.trigger_batch. I would rather document how users can write such a method, and let them write them as needed.

Bhavesh0327 commented 3 years ago

Hii @Gallaecio, I have one last small doubt left for my proposal. I was planning to add the multi-part upload feature for S3 storage. I went through the PRs and issues linked with it, telling how rewriting the existing patch and shifting it to boto3 will remove the existing errors. Can you please provide a little more input on what you are expecting from this patch?

Gallaecio commented 3 years ago

Can you please provide a little more input on what you are expecting from this patch?

I would summarize it as rewriting S3FeedStorage using boto3’s upload_fileobj method, which automatically uses multi-part support for big files.

scrapy / scrapy