probberechts / soccerdata

⛏⚽ Scrape soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.
https://soccerdata.readthedocs.io/en/latest/
Other
544 stars 95 forks source link

[WhoScored] Ignore cached events file if empty #420

Closed shufinskiy closed 8 months ago

shufinskiy commented 9 months ago

Hello, @probberechts.

I propose a solution to the problem of empty files in the cache for Whoscored.

In issue 98 you suggest delete empty file with bash command by file size.

I made method _size_file which does same with Path.stat().st_size. If the file is smaller than threshold, we believe that it is not cached

    def _size_file(
        self,
        filepath: Optional[Path] = None,
        filter_size: int = 60
    ) -> bool:
        """Check if `filepath` contains data valid size.
        Parameters
        ----------
        filepath : Path, optional
            Path where file should be cached. If None, return False.
        filter_size : int file size threshold. If file is smaller, return False
        Raises
        ------
        TypeError
            If filter_size is not an integer.
        Returns
        -------
        bool
            True in case of a cache hit, otherwise False.
        """
        if filepath is None:
            return False
        if not isinstance(filter_size, int):
            raise TypeError("filter_size must be of type int")
        try:
            file_size = filepath.stat().st_size
        except FileNotFoundError:
            return False
        return file_size > filter_size and filepath.exists()
probberechts commented 8 months ago

The problem only exists for the WhoScored scraper but your solution affects all scrapers. For some scrapers, an empty reply might actually be a valid answer. For example, if a new team is promoted, an empty result is expected in the ClubElo scraper.

Moreover, the bash script checks the file size simply because that was easy to write as a bash command. What it really should check is whether the file contains an empty JSON object. Something that could easily be done in Python.

shufinskiy commented 8 months ago

Yes, you're right. I'll think about how it can be implemented in a different way.

shufinskiy commented 8 months ago

@probberechts I fixed the verification logic: now inside the Whoscored.read_events method there is a check of the first 4 bytes of the file: if they are null, then the get method is run again with the no_cache=True parameter.

reader = self.get(
    url,
    filepath,
    var="requirejs.s.contexts._.config.config.params.args.matchCentreData",
    no_cache=live,
)
if reader.read(4) == b'null':
    reader = self.get(
        url,
        filepath,
        var="requirejs.s.contexts._.config.config.params.args.matchCentreData",
        no_cache=True,
    )
reader.seek(0)
json_data = json.load(reader)
probberechts commented 8 months ago

Nice solution! Thanks.