openclimatefix / nowcasting_forecast

Making live forecasts for the nowcasting project
MIT License
5 stars 2 forks source link

[Meta] Get Power Perceiver model into production #119

Open peterdudfield opened 2 years ago

peterdudfield commented 2 years ago

Here's the doc which describes what still needs to be done

TODO:

JackKelly commented 2 years ago

tbh, given the time pressure, and how fiddly the data is, I'm going to prioritise getting the model working on "real" data first, if that's OK?

peterdudfield commented 2 years ago

tbh, given the time pressure, and how fiddly the data is, I'm going to prioritise getting the model working on "real" data first, if that's OK?

As in, get the batch from s3? and run it on there?

peterdudfield commented 2 years ago

Hey @jacobbieker

Im just looking at the PP code - https://github.com/openclimatefix/power_perceiver/blob/main/power_perceiver/experiments/exp_027_longer_forecasts.py

I'm using this blog as useful guidance

Im a bit worried about the code going into production.

I think we've got two options

  1. Copy the code out to this forecast
  2. keep the code in that repo, but move it to a new folder, add github actions to repo

Either way, we need to

What do you think?

jacobbieker commented 2 years ago

I'd prefer keeping it in power perceiver, so we get more used to being able to more easily swap out models by importing them rather than copying them in every time. I agree on refactoring, but think we can keep the different parts of the full model together in a new folder, and just break it into fullmodel+component parts, the dataloader and training.

I agree on unit tests, I'd just go with one end to end test of loading the batch at the moment just so we can get the model into production a bit quicker, and then add more unit tests for the individual components.

peterdudfield commented 2 years ago

I'd prefer keeping it in power perceiver, so we get more used to being able to more easily swap out models by importing them rather than copying them in every time. I agree on refactoring, but think we can keep the different parts of the full model together in a new folder, and just break it into fullmodel+component parts, the dataloader and training.

I agree on unit tests, I'd just go with one end to end test of loading the batch at the moment just so we can get the model into production a bit quicker, and then add more unit tests for the individual components.

Thats sounds good, so roughly itll be this

Optional later

What do you think about a pydantic model for the data / dict that goes into the model? I'm happy to tackle this if you want - https://github.com/openclimatefix/power_perceiver/issues/188

jacobbieker commented 2 years ago

Yeah, a pydantic model for the data would be great, if you could tackle that as you have more experience with that, that would be awesome! I've already started the refactor locally for the production model, I'll push it and add the GH actions/basic test tomorrow

peterdudfield commented 2 years ago

Im doing the github actions at the moment, so hopefully tomorrow you just push the test

peterdudfield commented 1 year ago

Current Status is

memory issue

Potental solution

Would be interested in your thoughts @JackKelly (and @jacobbieker but I know you are on holiday at the moment)

peterdudfield commented 1 year ago

For batch size 2, running on a CPU

          RuntimeError: [enforce fail at alloc_cpu.cpp:73] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 50324972800 bytes. Error code 12 (Cannot allocate memory)

50324972800 bytes I think is 50GB

peterdudfield commented 1 year ago

Idea: Make sure floats are float32 not float64

peterdudfield commented 1 year ago

When reducing to float32,

Also ran for batch size 4, and got this error

RuntimeError: [enforce fail at alloc_cpu.cpp:73] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 100649945600 bytes. Error code 12 (Cannot allocate memory)

Which is double the precious amount of memory - so atleast batch sizing does work

peterdudfield commented 1 year ago

I managed to run it on a GPU, and batchsize 4 ran into a memory issue on a GPU with 16 GB.

peterdudfield commented 1 year ago

Current status,