Closed nuest closed 6 years ago
@nuest just to clarify: do you want to have an ordered list as a result or do you want to keep out less suitable candidates? (this concerns the displayfiles list)
The best_candidate
function is neither used for mainfiles nor for displayfiles explicitly, but instead completes all gathered metadata elements in a single list. for example if half of the relevant elements is found in one file and the rest in another during extraction. So it generates a competitive scenario for the most complex or complete information on each element extracted (this concerns all files encountered during extraction)
It might be a good idea to rewrite a more generic generation of best_candidate
. The code is not very sophisticated either, since I built this around everchanging requirements from the frontend guys 😃 ☮️
Ordered list.
Does the ordered list then also fix the default displayfile
, i.e. what is first in the list is then used?
It depends: what is the factor by which you have the lists sorted. That would require to quantify the "goodness" of the candidate, i.e. how close to the possible best candidate is each displayfile that has been seen during extraction. This is not yet implemented. If sorting by file extension is enough, that could be a quite easy task. Cf. also: https://github.com/o2r-project/o2r-meta/blob/3f9fb513b2a0b539c7db3ed53d17f2b2c92b90be/extract/metaextract.py#L395
Most importantly I need the recommended filename, i.e. display.html
, to be the first in the list.
Then other file extensions with the name "display", e.g. display.png
or display.pdf
, or other filenames for .html
... but no clear idea for that.
Maybe a string distance calculation of all file names with display.html
would work well? The one with smallest distance wins.
yes, this edit distance thing. might result in e.g. play.html
getting higher rank than display.jpeg
.
I think we should go the easy way and keep splitting the name and extension and then rank all things we can anticipate. do magic with the rest
Using our test compendium Aquestiondrivenprocess, the extractor creates the following metadata (excerpt):
In the display files, it should prefer
The existing files:
@7048730 can you confirm the function
best_candidate
currently only works for mainfiles, and that function must be extended for displayfiles?