Closed john-waczak closed 4 days ago
Further question to think about is how we want to run the pipeline. I was originally imagining submitting jobs to europa for each notebook parametrized by date range and node name. @mghpcsim also suggested potentially being able to run the notebooks via github as part of the workflow for rendering the site.
@davidlary @lakithaomal @mghpcsim (if you have any ideas, please add them below)
We need to figure out an appropriate way to access the data for our daily analyses. For now, I have copied some historical data for central node 8 to OSN to allow us to develop our anlaysis notebooks, but we will need a long term solution to be able to access the new data as it becomes available. Per our previous discussions, I think the current idea is to:
mfs
so that it is easily accessible anywhereWe may want to do a rclone for all of the historical data anyways so we can have it on OSN and get a sense for the current data volume (would probably be helpful for the AWS efforts). If the total size is only around 8 Tb so far, my current allocation should be sufficient but we may need to request an increase in space.
@lakithaomal, @davidlary If we do end up rclone'ing the csv's to OSN, I suggest we take this opportunity to clean up the file naming conventions. Currently, the CSV files are partitioned by device mac address, not device name. I think this makes browsing the data harder than it needs to be as what we really want is all of the sensors that are together on each Node.