rstudio / pins-python

https://rstudio.github.io/pins-python/
MIT License
50 stars 12 forks source link

Add docs about unsuitability of high-frequency writes #258

Closed nathanjmcdougall closed 2 months ago

nathanjmcdougall commented 2 months ago

Currently running the same write repeatedly one-after-the-other raises an error:

board.pin_write(df, "cool_pin", type="csv")
board.pin_write(df, "cool_pin", type="csv")

pins.errors.PinsError: Attempting to write pin version to [...]/20240720T080112Z-320a1, but that directory already exists.

Presumably this is just because both writes occurred in the same second, so they get the same filename. I expect this sort of thing could come up when people are using pins in a pipeline that unexpectedly runs really quickly for whatever reason. Maybe that's an abuse of pins 🤷

I reckon the individual board object can keep track of when the most recent write was and impose a within-object rate limiting to cope with this. In theory, you could have a separate rate limit for each pin but that's just added complexity.

All that's needed is a new class variable which stores the most recent write time, a conditional check and a time.sleep call, probably also a call to inform since this is an unusual case.

With all that said, my main motivation for this is just to make testing easier - but I think it also has the benefit of avoiding a somewhat cryptic error message and adding some robustness.

juliasilge commented 2 months ago

In the R package, we handle this by mocking the version name in many tests. For example, here:

https://github.com/rstudio/pins-r/blob/de47c3141d97020e9f4a6a2e93d5823bc649105b/R/testthat.R#L133

I don't think pins itself needs to be able to support writing again really fast, again and again. Overall, it's not a good tool for that kind of work anyway and it hasn't ever come up as a real use case from folks.

nathanjmcdougall commented 2 months ago

That sounds reasonable. I think then I'll look to add some documentation about this limitation in this section: https://rstudio.github.io/pins-python/get_started.html#how-and-what-to-store-as-a-pin It currently gives advice against using pins for concurrent write from multiple processes, and I think this is a similar case that can just be added explicitly.

For context, I am working on #241, and I'm trying to create the equivalent of this test: https://github.com/rstudio/pins-r/blob/de47c3141d97020e9f4a6a2e93d5823bc649105b/R/testthat.R#L165-L173 I'm interested how this test passes in R without dedicated rate-limiting or mocking. I confess that I'm not proficient enough in R to be familiar with the way that testthat::local_mocked_bindings works in the case you've linked.

In Python, I could monkey-patch the board after the initial write, to modify the name of a pin version after the fact. That seems quite unpleasant within the test itself, but it might be tolerable in a few tests. Another option is to temporarily mock the datetime (!) during the initial write.

nathanjmcdougall commented 2 months ago

Aha, the time.sleep approach is already being used in the test suite:

https://github.com/rstudio/pins-python/blob/70380b4532f3778e4777888507a49c2d1cab1ae8/pins/tests/test_boards.py#L351-L355

Although in other places the Python-only created argument is being used (which is what I will use I think). https://github.com/rstudio/pins-python/blob/70380b4532f3778e4777888507a49c2d1cab1ae8/pins/tests/test_boards.py#L363-L371