Bsweger/move code from infrastructure repo

bsweger commented 7 months ago

Resolves Infectious-Disease-Modeling-Hubs/hubverse-cloud#68

This somewhat sizable PR ports some MVP test code to a stand-alone repo. There's some additional context below, and I'll annotate inline.

For this initial review, it would useful to have someone:

I'm less concerned about the overall design of the class (frankly, there's a lot I don't love about it). However, it's time to get this code into its own repo so we can iterate on it properly.

Additional info

This PR creates the first iteration of a Python function that:

reads a single model-output file from S3 (for now, in parquet or .csv format)
adds three columns: round_id, team, and model
writes the updated information back to S3 in parquet format

This code was part of the "proof of concept" for using S3 event notifications as a way to trigger a Lambda function that processes model-output files being sent to S3. It's written in Python because AWS Lambda doesn't support an R runtime out of the box.

The diagram below shows how the "Lambda function triggered by a new S3 object" builds on top of the S3 sync already in place:

annakrystalli commented 7 months ago

Hello @bsweger ! I took a quick pass just reading through the PR here on GitHub (not installed or played with anything locally) and added two very minor comments.

Overall all looks generally sound to me. My one bigger question at this point is, can we guarantee that the function is reading and encoding NAs correctly? I know this can be an issue when using python to read in csv with values encoded as NAs in R so might be a good idea to set up a test that includes a model output csv file with e.g. a mean output type id and check that when the file is accessed again in the parquet format, the output_type_id value is still NA.

bsweger commented 7 months ago

My one bigger question at this point is, can we guarantee that the function is reading and encoding NAs correctly? I know this can be an issue when using python to read in csv with values encoded as NAs in R so might be a good idea to set up a test that includes a model output csv file with e.g. a mean output type id and check that when the file is accessed again in the parquet format, the output_type_id value is still NA.

Heya @annakrystalli, that's a good call, and I just pushed a commit to make the test .csv/.parquet files represent a wider-array of potential model-ouput data (e.g., mean output_type with an empty output_type_id)

Your point about R's NA is better suited to an integration test, which I'm working on and can push later tonight. Preliminary testing looks promising (pyarrow seems smart enough to read in the NA and render it in Python as null)

As for writing the data back out to parquet...is NA R-specific? Asking because if we're standardizing on parquet as our model-output standard, I think we should use whatever the parquet standard is for representing null/na

annakrystalli commented 7 months ago

My understanding is that indeed Parquet files have their own way of encoding missing values that is language agnostic so all good there, as you suggest. NAs are indeed R specific so if all has worked well in the transformation, when connecting to the transformed hub and collecting in R, the original NA values should still be interpreted as NA. So indeed sounds more like an integration test and good to hear results so far look promising!

bsweger commented 7 months ago

@annakrystalli this commit adds integration tests (it doesn't integrate directly with hubData, though that would be a good future improvement, if you want to add an issue)

Writing this revealed that we could indeed be doing more to normalize missing data when reading .csvs (you'll see that change in the commit linked above). Really appreciate the question!

annakrystalli commented 7 months ago

Glad it was useful!

I will go ahead and approve the PR as my particular concern has been addressed but noting again that I haven't addressed the specific items you asked for feedback on so probably still a good idea to get @lshandross's input too?

In the meantime I'll open an issue re: integration test with hubData.

bsweger commented 7 months ago

Noting that @lshandross found some setup issues that we should address (or document). However, we agreed to come back to those once they've had a resolution with IT about some specific problems.

Thanks for taking on the challenge of working through the README @lshandross. The goal is to minimize setup headaches, and ideally get to a place where someone without admin access to their machine can get the project set up.

hubverse-org / hubverse-transform

Bsweger/move code from infrastructure repo #1