lappsgrid-incubator / galaxy-paper-rank

Project home for the paper identification project for Galaxy.
Apache License 2.0
0 stars 0 forks source link

Partition training data #11

Closed ksuderman closed 3 years ago

ksuderman commented 3 years ago

Set up a system/program to partition the training data (see #1 and #10) into k partitions so we can perform k-fold validation runs.

Inputs

  1. k - the number of partitions to generate
  2. n - the total size of the generated evaluation set
  3. a directory of files containing positive examples
  4. a directory of files containing negative examples
  5. the output directory to write results.

Outputs

  1. k files in the output directory, each with n/k file IDs

Notes If n < |training set| then print a warning and generate the k partitions using the entire training set. else select n documents at random from the entire training set.

ksuderman commented 3 years ago

Part of #5

nancyide commented 3 years ago

No need to do this if using BERT. Should get rid of this task.