o2r-project / o2r-meta

Metadata toolsuite for an extract-map-validate workflow supporting reproducible research
Apache License 2.0
2 stars 3 forks source link

Ordering of display files and default selection #105

Closed nuest closed 6 years ago

nuest commented 6 years ago

Using our test compendium Aquestiondrivenprocess, the extractor creates the following metadata (excerpt):

[...]
"raw": {
"mainfile_candidates": [
"shFun.R",
"multiplot.R",
"Ui.R",
"LaunchModel.R",
"main.Rmd",
"server.R"
],
"mainfile": "main.Rmd",
"displayfile_candidates": [
"figure89.png",
"figure2.png",
"figure3.png",
"table1.png",
"display.html",
"figure4.png",
"figure6.png",
"figure5.png",
"table4.png",
"table2.png",
"table5.png",
"figure1.png",
"table3.png"
],
"displayfile": "figure89.png",
[...]

In the display files, it should prefer

The existing files:

image

@7048730 can you confirm the function best_candidate currently only works for mainfiles, and that function must be extended for displayfiles?

ghost commented 6 years ago

@nuest just to clarify: do you want to have an ordered list as a result or do you want to keep out less suitable candidates? (this concerns the displayfiles list)

The best_candidate function is neither used for mainfiles nor for displayfiles explicitly, but instead completes all gathered metadata elements in a single list. for example if half of the relevant elements is found in one file and the rest in another during extraction. So it generates a competitive scenario for the most complex or complete information on each element extracted (this concerns all files encountered during extraction)

It might be a good idea to rewrite a more generic generation of best_candidate. The code is not very sophisticated either, since I built this around everchanging requirements from the frontend guys 😃 ☮️

nuest commented 6 years ago

Ordered list.

Does the ordered list then also fix the default displayfile, i.e. what is first in the list is then used?

ghost commented 6 years ago

It depends: what is the factor by which you have the lists sorted. That would require to quantify the "goodness" of the candidate, i.e. how close to the possible best candidate is each displayfile that has been seen during extraction. This is not yet implemented. If sorting by file extension is enough, that could be a quite easy task. Cf. also: https://github.com/o2r-project/o2r-meta/blob/3f9fb513b2a0b539c7db3ed53d17f2b2c92b90be/extract/metaextract.py#L395

nuest commented 6 years ago

Most importantly I need the recommended filename, i.e. display.html, to be the first in the list.

Then other file extensions with the name "display", e.g. display.png or display.pdf, or other filenames for .html... but no clear idea for that.

Maybe a string distance calculation of all file names with display.html would work well? The one with smallest distance wins.

ghost commented 6 years ago

yes, this edit distance thing. might result in e.g. play.html getting higher rank than display.jpeg. I think we should go the easy way and keep splitting the name and extension and then rank all things we can anticipate. do magic with the rest