Closed shawnmjones closed 3 years ago
This work can not truly be completed until other work is done because many of the algorithms run by sample
require identify
(#44), score
(#42), order
(#43), cluster
(#45), and filter
(#47).
At this point, sample
supports the following (not completely tested) algorithms out of the box:
# hc sample --help
usage: hc sample [-h] {DSA1,DSA2,DSA3,DSA4,filtered-random,order-by-memento-datetime-then-systematically-sample,simple-search-engine,true-random,systematic,stratified-random,stratified-systematic,random-cluster,random-oversample,random-undersample} ...
'sample' produces a list of exemplars from a collection by applying an existing algorithm
positional arguments:
{DSA1,DSA2,DSA3,DSA4,filtered-random,order-by-memento-datetime-then-systematically-sample,simple-search-engine,true-random,systematic,stratified-random,stratified-systematic,random-cluster,random-oversample,random-undersample}
sampling methods
DSA1 An implementation of the algorithm from AlNoamany's dissertation.
DSA2 An implementation of the DSA2 algorithm from Jones' dissertation.
DSA3 An implementation of the DSA3 algorithm from Jones' dissertation.
DSA4 An implementation of the DSA4 algorithm from Jones' dissertation.
filtered-random Filter the collection for off-topic mementos and exclude near duplicates before randomly sampling from remainder.
order-by-memento-datetime-then-systematically-sample
Select exemplars from a web archive collection by first ordering a colleciton, then systematically sampling every jth memento from the remainder.
simple-search-engine
Search for mementos with a specific pattern, score results by BM25, order by descending score.
true-random sample probabilistically by randomly sampling k mementos from the input
systematic returns every jth memento from the input
stratified-random returns j items randomly chosen from each cluster, requries that the input be clustered with the cluster action
stratified-systematic
returns every jth URI-M from each cluster, requries that the input be clustered with the cluster action
random-cluster return j randomly selected clusters from the sample, requires that the input be clustered with the cluster action
random-oversample randomly duplicates URI-Ms in the smaller clusters until they match the size of the largest cluster, requires input be clustered with the cluster action
random-undersample randomly chooses URI-Ms from the larger clusters until they match the size of the smallest cluster, requires input be clustered with the cluster action
optional arguments:
-h, --help show this help message and exit
The arguments for these all appear in Wooey, so it looks like sample
works properly in the GUI as well.
I developed a method of annotating BASH scripts with some JSON so that Hypercane is aware of the arguments supported by the BASH script. This seems to have worked well. I will not implement any more algorithms until after we have tested more with NLA.
This works now that caching is enabled. Closing.
The existing CLI application must be reworked. This work was started already and needs to be tested.
Once that work is done, we can add the corresponding GUI script for the Wooey interface.