jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Can we move gdrive_filelist.csv to the repo? #43

Open jowagner opened 3 years ago

jowagner commented 3 years ago

It would be handy for issue #35 to have gdrive_filelist.csv in this repo rather than in cloud storage. Can the list of filenames be published or is the list a secret?

jbrry commented 3 years ago

@tlynn747 Do you see any problem with listing the filenames as they are on Google Drive somewhere in this repository so that we know exactly which files were used and ignored from Google Drive? This would be helpful for reproducibility in future runs. There are 2 ways to log this information:

  1. uploading gdrive_filelist.csv to somewhere in the repository (as the issue suggests)
  2. logging filenames used by scripts/text_processor.py
alanagiasi commented 3 years ago

@tlynn747 Do you see any problem with listing the filenames as they are on Google Drive somewhere in this repository so that we know exactly which files were used and ignored from Google Drive? This would be helpful for reproducibility in future runs. There are 2 ways to log this information:

  1. uploading gdrive_filelist.csv to somewhere in the repository (as the issue suggests)
  2. logging filenames used by scripts/text_processor.py

I'd agree Option 1 would be the simplest and preferable approach for reproducibility and maintenance. (Aside: The thinking when gdrive_filelist.csv was created was to store it with the data files on Google Drive as it seemed a convenient place for editing the file (no need for code checkouts etc.), and through caution assuming the file names could be sensitive so to not include them in or with the code which may be become publicly available on github).

Option 2 would require (for reproducing results) manually comparing approximately 160 filenames in gdrive_filelist.csv against the log, which would be time consuming, error prone, and ensuring the logs don't end up in the repo. Actually, if the log output was a CSV file formatted like gdrive_filelist.csv we could avoid some of that hassle. But still the log files need to be managed centrally (if local it's not really reproducible) and somehow linked with each model, while being kept private.

Option 3 - Obfuscate the filenames. If publishing the filenames is an issue we could change them to something random and/or meaningless which could appear in the repo. We'd then have to map these to the actual filenames e.g. in gdrive_filelist.csv. Not as good as Option 1, but an alternative to Option 2.

jowagner commented 3 years ago

TLDR for @tlynn747: All after the initial question is a discussion of what to do if you tell us the answer is that we must keep the list secret.

With options 2, even if the log contains a 1:1 copy of gdrive_filelist.csv, you'd still have a hard time figuring out what the specific selection of files means, i.e. what criteria led to the selection. A possibility would be to add a file known_gdrive_filelists.csv to the repo that has two columns "sha256sum of grdrive_filelist" and "description" and to print the sha256sum to the log. Our tools could print a reminder to update this file when an unknown filelist is used.

Option 3: Working with randomised filenames would be hard for maintaining the files on gdrive and for checking things. What might work is to use IDs and split the filelist into two parts: (1) a list maintained on gdrive that maps filenames to IDs, never changing existing rows, (2) a list in the source code repo that maps IDs to inclusion status. To make it easy to work with these, they should be aligned line-by-line all the time. Alternatively, one could work on the output of the join command.

Both options 1 and 3 require us to write good commit messages and commit frequently so that there is useful information on any change of the filelist in the repo.

jbrry commented 3 years ago

Yes, I agree Options 1 and 3 better. Assuming Option 1 doesn't go ahead and using Joachim's description of Option 3 as an example, it could look something like this:

1) Store a mapping from filenames to IDs on Google Drive:

{"filename": "sha256sum"}

2) Store a filelist which maps IDs to inclusion status (somewhere in this repository):

{"sha256sum": True/False}
jowagner commented 3 years ago
{"filename": "sha256sum"}
[...]
{"sha256sum": True/False}

Using a hash function for option 3 is not good as it allows people to test whether a given candidate path is in the list. An attacker can test many millions of candidates and guessing becomes a lot easier with each piece of information that on its own looks uncritical, e.g. the name of the top-level folder.

In contrast, a hash of the overall .csv file for option 2 is fine as it is not feasible to guess the full list of paths.