oduwsdl / hypercane

A toolkit for developing algorithms that sample mementos from a web archive collection.
https://oduwsdl.github.io/hypercane
MIT License
5 stars 3 forks source link

Finish Hypercane GUI script for sample action #41

Closed shawnmjones closed 3 years ago

shawnmjones commented 3 years ago

The existing CLI application must be reworked. This work was started already and needs to be tested.

Once that work is done, we can add the corresponding GUI script for the Wooey interface.

shawnmjones commented 3 years ago

This work can not truly be completed until other work is done because many of the algorithms run by sample require identify (#44), score (#42), order (#43), cluster (#45), and filter (#47).

shawnmjones commented 3 years ago

At this point, sample supports the following (not completely tested) algorithms out of the box:

# hc sample --help                                                                                                                                                                                                                                                        
usage: hc sample [-h] {DSA1,DSA2,DSA3,DSA4,filtered-random,order-by-memento-datetime-then-systematically-sample,simple-search-engine,true-random,systematic,stratified-random,stratified-systematic,random-cluster,random-oversample,random-undersample} ...

'sample' produces a list of exemplars from a collection by applying an existing algorithm

positional arguments:
  {DSA1,DSA2,DSA3,DSA4,filtered-random,order-by-memento-datetime-then-systematically-sample,simple-search-engine,true-random,systematic,stratified-random,stratified-systematic,random-cluster,random-oversample,random-undersample}
                        sampling methods
    DSA1                An implementation of the algorithm from AlNoamany's dissertation.
    DSA2                An implementation of the DSA2 algorithm from Jones' dissertation.
    DSA3                An implementation of the DSA3 algorithm from Jones' dissertation.
    DSA4                An implementation of the DSA4 algorithm from Jones' dissertation.
    filtered-random     Filter the collection for off-topic mementos and exclude near duplicates before randomly sampling from remainder.
    order-by-memento-datetime-then-systematically-sample
                        Select exemplars from a web archive collection by first ordering a colleciton, then systematically sampling every jth memento from the remainder.
    simple-search-engine
                        Search for mementos with a specific pattern, score results by BM25, order by descending score.
    true-random         sample probabilistically by randomly sampling k mementos from the input
    systematic          returns every jth memento from the input
    stratified-random   returns j items randomly chosen from each cluster, requries that the input be clustered with the cluster action
    stratified-systematic
                        returns every jth URI-M from each cluster, requries that the input be clustered with the cluster action
    random-cluster      return j randomly selected clusters from the sample, requires that the input be clustered with the cluster action
    random-oversample   randomly duplicates URI-Ms in the smaller clusters until they match the size of the largest cluster, requires input be clustered with the cluster action
    random-undersample  randomly chooses URI-Ms from the larger clusters until they match the size of the smallest cluster, requires input be clustered with the cluster action

optional arguments:
  -h, --help            show this help message and exit

The arguments for these all appear in Wooey, so it looks like sample works properly in the GUI as well.

I developed a method of annotating BASH scripts with some JSON so that Hypercane is aware of the arguments supported by the BASH script. This seems to have worked well. I will not implement any more algorithms until after we have tested more with NLA.

shawnmjones commented 3 years ago

This works now that caching is enabled. Closing.