Partition training data

ksuderman commented 3 years ago

Set up a system/program to partition the training data (see #1 and #10) into k partitions so we can perform k-fold validation runs.

Inputs

k - the number of partitions to generate
n - the total size of the generated evaluation set
a directory of files containing positive examples
a directory of files containing negative examples
the output directory to write results.

Outputs

k files in the output directory, each with n/k file IDs

Notes If n < |training set| then print a warning and generate the k partitions using the entire training set. else select n documents at random from the entire training set.

ksuderman commented 3 years ago

Part of #5

nancyide commented 3 years ago

No need to do this if using BERT. Should get rid of this task.

lappsgrid-incubator / galaxy-paper-rank

Partition training data #11