Open amzoss opened 4 years ago
per discussion on 9/3, this isn't a barrier for the submission system. The submission system allows people to specify multiple links to sources for either datasets or visualizations or both. After we receive a submission, we'll have to parse that field manually, download any files, and then figure out a backend system for storing, processing, and associating those asset files with the relevant example pid.
One possibility is to make a new wax task that mimics the image processing wax task. We would then decide on file storage conventions for asset files (e.g., a "raw_data" folder instead of "raw_images") and figure out how to: update the collection metadata with asset info, then update the item detail pages to display one or more assets for download.
User will be able to run:
bundle exec rake wax:collect_datasets datasets
resulting in a subfolder named per each record's pid
placed within a datasets\collected\
or similar folder (equivalent of how derivative rake process currently populates img\
). Each dataset pid subfolder one or more files of recognized dataset types (e.g. .csv, .json, .xls(x)...) Afterwards, the pages
wax rake task must add paths to dataset metadata appropriately.
collect_datasets.rake
script, mimicking derivatives_simple.rake
to start. This script handles on a high level the rake process and is initiated via bundle exec rake wax:collect_datasets datasets
lib\wax_tasks\site.rb
, create new collect_datasets
function, mimicking generate_derivatives
. This function finds relevant collection, handles errors, and will call two collection methods: .write_datasets
(new, written in next step) and .update_metadata (should still function correctly I thiiink!)lib\wax_tasks\collection\
, create a new datasets.rb
script. This is where write_datasets
will live, similar to write_simple_derivatives in images.rb
, and items_from_datasets
, similar to items_from_imagedata
. items_from_datasets
functionwrite_datasets
function(sidenote for future -- would we ever consider a fetch raw data wax task for datasets and/or datavis? or assume that should always involve some human curation?)
@amzoss, when you have a moment, two quick Q's for you:
minicomp\wax_tasks
within the visualizingthefuture account? I don't have organization-level privileges for VtF, but after doing some basic proof-of-concept playing in my own account I now realize it's much easier to collaborate over there.Thank you!
Hi @zoews ! I haven't looked into the rake tasks, so I am taking your word on the changes in wax_tasks. The first paragraph, however, says that it is the pages
wax tax that edits the metadata to add paths, and I think that's incorrect. I think that currently happens in derivatives
, so it should be sufficient to generate a new task that both creates a directory for dataset files and writes the paths back into the metadata.
On your sidenote for future, should that just get added as an issue, and we can evaluate it sometime?
Happy to fork wax_tasks, will try to achieve that shortly.
Thanks so much!
Another thought: maybe have the task name agnostic to the type of file? If we expand the repository to different types of non-image-based collections (e.g., 3D models, videos, course modules, etc.) we might use this same task, since it doesn't presume visual content. Also, maybe for that reason we shouldn't have a list of recognized file extensions?
@zoews Fork ready: https://github.com/visualizingthefuture/wax_tasks
Thanks @amzoss !! Hugely helpful on all fronts. I think solving this problem agnostic of file/asset type is a very strong approach -- and thanks for making that fork! I will crave out time for this all and report back.
Looks like postwax is another example of people adding rake tasks to a Wax workflow, in case it's of any interest! https://github.com/pbinkley/postwax
That is, create a process where dataset files get stored somewhere and then links to those files automatically get added to metadata .csv file, possibly by editing/extending the rake tasks to include general asset processing