Open jowagner opened 3 years ago
@tlynn747 Do you see any problem with listing the filenames as they are on Google Drive somewhere in this repository so that we know exactly which files were used and ignored from Google Drive? This would be helpful for reproducibility in future runs. There are 2 ways to log this information:
gdrive_filelist.csv
to somewhere in the repository (as the issue suggests)scripts/text_processor.py
@tlynn747 Do you see any problem with listing the filenames as they are on Google Drive somewhere in this repository so that we know exactly which files were used and ignored from Google Drive? This would be helpful for reproducibility in future runs. There are 2 ways to log this information:
- uploading
gdrive_filelist.csv
to somewhere in the repository (as the issue suggests)- logging filenames used by
scripts/text_processor.py
I'd agree Option 1 would be the simplest and preferable approach for reproducibility and maintenance.
(Aside: The thinking when gdrive_filelist.csv
was created was to store it with the data files on Google Drive as it seemed a convenient place for editing the file (no need for code checkouts etc.), and through caution assuming the file names could be sensitive so to not include them in or with the code which may be become publicly available on github).
Option 2 would require (for reproducing results) manually comparing approximately 160 filenames in gdrive_filelist.csv
against the log, which would be time consuming, error prone, and ensuring the logs don't end up in the repo. Actually, if the log output was a CSV file formatted like gdrive_filelist.csv
we could avoid some of that hassle. But still the log files need to be managed centrally (if local it's not really reproducible) and somehow linked with each model, while being kept private.
Option 3 - Obfuscate the filenames. If publishing the filenames is an issue we could change them to something random and/or meaningless which could appear in the repo. We'd then have to map these to the actual filenames e.g. in gdrive_filelist.csv
. Not as good as Option 1, but an alternative to Option 2.
TLDR for @tlynn747: All after the initial question is a discussion of what to do if you tell us the answer is that we must keep the list secret.
With options 2, even if the log contains a 1:1 copy of gdrive_filelist.csv
, you'd still have a hard time figuring out what the specific selection of files means, i.e. what criteria led to the selection. A possibility would be to add a file known_gdrive_filelists.csv
to the repo that has two columns "sha256sum of grdrive_filelist" and "description" and to print the sha256sum to the log. Our tools could print a reminder to update this file when an unknown filelist is used.
Option 3: Working with randomised filenames would be hard for maintaining the files on gdrive and for checking things. What might work is to use IDs and split the filelist into two parts: (1) a list maintained on gdrive that maps filenames to IDs, never changing existing rows, (2) a list in the source code repo that maps IDs to inclusion status. To make it easy to work with these, they should be aligned line-by-line all the time. Alternatively, one could work on the output of the join
command.
Both options 1 and 3 require us to write good commit messages and commit frequently so that there is useful information on any change of the filelist in the repo.
Yes, I agree Options 1 and 3 better. Assuming Option 1 doesn't go ahead and using Joachim's description of Option 3 as an example, it could look something like this:
1) Store a mapping from filenames to IDs on Google Drive:
{"filename": "sha256sum"}
2) Store a filelist which maps IDs to inclusion status (somewhere in this repository):
{"sha256sum": True/False}
{"filename": "sha256sum"} [...] {"sha256sum": True/False}
Using a hash function for option 3 is not good as it allows people to test whether a given candidate path is in the list. An attacker can test many millions of candidates and guessing becomes a lot easier with each piece of information that on its own looks uncritical, e.g. the name of the top-level folder.
In contrast, a hash of the overall .csv
file for option 2 is fine as it is not feasible to guess the full list of paths.
It would be handy for issue #35 to have
gdrive_filelist.csv
in this repo rather than in cloud storage. Can the list of filenames be published or is the list a secret?