allow datasets to be connected to raw data files automatically

amzoss commented 4 years ago

That is, create a process where dataset files get stored somewhere and then links to those files automatically get added to metadata .csv file, possibly by editing/extending the rake tasks to include general asset processing

amzoss commented 4 years ago

per discussion on 9/3, this isn't a barrier for the submission system. The submission system allows people to specify multiple links to sources for either datasets or visualizations or both. After we receive a submission, we'll have to parse that field manually, download any files, and then figure out a backend system for storing, processing, and associating those asset files with the relevant example pid.

One possibility is to make a new wax task that mimics the image processing wax task. We would then decide on file storage conventions for asset files (e.g., a "raw_data" folder instead of "raw_images") and figure out how to: update the collection metadata with asset info, then update the item detail pages to display one or more assets for download.

cassws commented 3 years ago

Desired functionality for user

User will be able to run:

bundle exec rake wax:collect_datasets datasets

resulting in a subfolder named per each record's pid placed within a datasets\collected\ or similar folder (equivalent of how derivative rake process currently populates img\). Each dataset pid subfolder one or more files of recognized dataset types (e.g. .csv, .json, .xls(x)...) Afterwards, the pages wax rake task must add paths to dataset metadata appropriately.

Necessary changes in wax_tasks

[ ] Create new collect_datasets.rake script, mimicking derivatives_simple.rake to start. This script handles on a high level the rake process and is initiated via bundle exec rake wax:collect_datasets datasets
[ ] In lib\wax_tasks\site.rb , create new collect_datasets function, mimicking generate_derivatives. This function finds relevant collection, handles errors, and will call two collection methods: .write_datasets (new, written in next step) and .update_metadata (should still function correctly I thiiink!)
[ ] In lib\wax_tasks\collection\, create a new datasets.rb script. This is where write_datasets will live, similar to write_simple_derivatives in images.rb, and items_from_datasets, similar to items_from_imagedata.
[ ] Write items_from_datasets function
[ ] Write write_datasets function
[ ] Test! Ensure original user requirements were met, adjust as needed

(sidenote for future -- would we ever consider a fetch raw data wax task for datasets and/or datavis? or assume that should always involve some human curation?)

cassws commented 3 years ago

@amzoss, when you have a moment, two quick Q's for you:

Do you have any initial feedback on the approach above, in case I'm missing anything or anything is out of scope?
Would you be able to initialize a fork of minicomp\wax_tasks within the visualizingthefuture account? I don't have organization-level privileges for VtF, but after doing some basic proof-of-concept playing in my own account I now realize it's much easier to collaborate over there.

Thank you!

amzoss commented 3 years ago

Hi @zoews ! I haven't looked into the rake tasks, so I am taking your word on the changes in wax_tasks. The first paragraph, however, says that it is the pages wax tax that edits the metadata to add paths, and I think that's incorrect. I think that currently happens in derivatives, so it should be sufficient to generate a new task that both creates a directory for dataset files and writes the paths back into the metadata.

On your sidenote for future, should that just get added as an issue, and we can evaluate it sometime?

Happy to fork wax_tasks, will try to achieve that shortly.

Thanks so much!

amzoss commented 3 years ago

Another thought: maybe have the task name agnostic to the type of file? If we expand the repository to different types of non-image-based collections (e.g., 3D models, videos, course modules, etc.) we might use this same task, since it doesn't presume visual content. Also, maybe for that reason we shouldn't have a list of recognized file extensions?

amzoss commented 3 years ago

@zoews Fork ready: https://github.com/visualizingthefuture/wax_tasks

cassws commented 3 years ago

Thanks @amzoss !! Hugely helpful on all fronts. I think solving this problem agnostic of file/asset type is a very strong approach -- and thanks for making that fork! I will crave out time for this all and report back.

amzoss commented 3 years ago

Looks like postwax is another example of people adding rake tasks to a Wax workflow, in case it's of any interest! https://github.com/pbinkley/postwax

visualizingthefuture / examples-repository

allow datasets to be connected to raw data files automatically #65

Desired functionality for user

Necessary changes in wax_tasks