metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
71 stars 34 forks source link

sort-triples: Add option to sort triples as numbers not as strings #380

Closed TobiasNx closed 3 years ago

TobiasNx commented 3 years ago

If you count-triples and then sort-triples(by="OBJECT") values are as sorted as strings. OBJECT after counting is the counted number as string.

e.g.: following list (with template("${o}\t${s}"))

...
24      metadata.mods.relatedItem.typeOfResource.usage
248     metadata.mods.name.role.roleTerm.value
27      metadata.mods.subject.type
29      metadata.mods.abstract.value
29      metadata.mods.name.affiliation.value
3       metadata.mods.originInfo.dateModified.encoding
3       metadata.mods.originInfo.dateModified.value
30      metadata.mods.relatedItem.abstract.altRepGroup
320     metadata.mods.name.role.roleTerm.authority
320     metadata.mods.name.role.roleTerm.type
322     metadata.mods.name.type
...

It would be great if triples could also be sorted as numbers not just as string.

dr0i commented 3 years ago

Have you tried:

sort-triples(by="OBJECT", order="DECREASING")
TobiasNx commented 3 years ago

just tried it, unfortunately it does not work, same problem. Sorts as literal not as integer.

24      metadata.mods:mods.mods:relatedItem.mods:typeOfResource.manuscript
24      metadata.mods:mods.mods:relatedItem.mods:typeOfResource.usage
22      metadata.mods:mods.mods:abstract.altFormat
22      metadata.mods:mods.mods:abstract.altRepGroup
22      metadata.mods:mods.mods:abstract.contentType
1629    _id
1629    header.datestamp.value
1629    header.identifier.value
1575    header.status
15      metadata.mods:mods.mods:relatedItem.mods:abstract.altFormat
15      metadata.mods:mods.mods:relatedItem.mods:abstract.altRepGroup
15      metadata.mods:mods.mods:relatedItem.mods:abstract.contentType
14      metadata.mods:mods.mods:subject.mods:topic.authority
14      metadata.mods:mods.mods:subject.mods:topic.authorityURI
14      metadata.mods:mods.mods:subject.mods:topic.valueURI
1       metadata.mods:mods.mods:classification.usage
1       metadata.mods:mods.mods:genre.usage
1       metadata.mods:mods.mods:language.usage
1       metadata.mods:mods.mods:name.mods:nameIdentifier.invalid
1       metadata.mods:mods.mods:name.usage
dr0i commented 3 years ago

Should be implemented, see https://github.com/metafacture/metafacture-core/issues/43. But seems not to work. Test is missing, also.

TobiasNx commented 3 years ago

But the decreasing form of sorting is working, only the parameter by what is not. It is sorted decreasingly by alphanumerical values but not as integer values.

Decreasing as alphanumeric:

Is:

366
34
3
26
2444
2222555
113
19
1

Decreasing as integers:

Should:

22225555
2444
366
113
34
26
19
3
1
blackwinter commented 3 years ago

Should be addressed by #409 (sort-triples(by="OBJECT",numeric=true)). Can you confirm?

TobiasNx commented 3 years ago

Unfortunately it does not. Also I have not seen an option numeric=true with an value that does not have quotation marks even boolean in metafacture flux.

tried it with: https://raw.githubusercontent.com/TobiasNx/notWorkingFlux/main/sortTripplesNumeric/json-api-structure.flux

Error-Response:

Exception in thread "main" org.metafacture.flux.FluxParseException: Variable true not assigned.
        at org.metafacture.flux.parser.FlowBuilder.exp(FlowBuilder.java:604)
        at org.metafacture.flux.parser.FlowBuilder.arg(FlowBuilder.java:775)
        at org.metafacture.flux.parser.FlowBuilder.pipe(FlowBuilder.java:718)
        at org.metafacture.flux.parser.FlowBuilder.flowtail(FlowBuilder.java:514)
        at org.metafacture.flux.parser.FlowBuilder.flow(FlowBuilder.java:226)
        at org.metafacture.flux.parser.FlowBuilder.flux(FlowBuilder.java:122)
        at org.metafacture.flux.FluxCompiler.compileFlow(FluxCompiler.java:56)
        at org.metafacture.flux.FluxCompiler.compile(FluxCompiler.java:44)
        at org.metafacture.runner.Flux.main(Flux.java:78)

when using: | sort-triples(By="SUBJECT",numeric="TRUE")

Exception in thread "main" java.lang.NumberFormatException: For input string: "_index"
        at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.base/java.lang.Integer.parseInt(Integer.java:652)
        at java.base/java.lang.Integer.valueOf(Integer.java:983)
        at java.base/java.util.function.Function.lambda$andThen$1(Function.java:88)
        at org.metafacture.triples.AbstractTripleSort.lambda$createComparator$2(AbstractTripleSort.java:216)
        at java.base/java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
        at java.base/java.util.TimSort.sort(TimSort.java:234)
        at java.base/java.util.Arrays.sort(Arrays.java:1515)
        at java.base/java.util.ArrayList.sort(ArrayList.java:1750)
        at java.base/java.util.Collections.sort(Collections.java:179)
        at org.metafacture.triples.AbstractTripleSort.onCloseStream(AbstractTripleSort.java:137)
        at org.metafacture.framework.helpers.DefaultSender.closeStream(DefaultSender.java:68)
        at org.metafacture.framework.helpers.DefaultSender.closeStream(DefaultSender.java:70)
        at org.metafacture.framework.helpers.DefaultSender.closeStream(DefaultSender.java:70)
        at org.metafacture.metamorph.Metamorph.closeStream(Metamorph.java:321)
        at org.metafacture.framework.helpers.DefaultSender.closeStream(DefaultSender.java:70)
        at org.metafacture.framework.helpers.DefaultSender.closeStream(DefaultSender.java:70)
        at org.metafacture.framework.helpers.DefaultSender.closeStream(DefaultSender.java:70)
        at org.metafacture.framework.helpers.DefaultSender.closeStream(DefaultSender.java:70)
        at org.metafacture.framework.helpers.DefaultSender.closeStream(DefaultSender.java:70)
        at org.metafacture.flux.parser.Flow.close(Flow.java:122)
        at org.metafacture.flux.parser.FluxProgramm.start(FluxProgramm.java:164)
        at org.metafacture.runner.Flux.main(Flux.java:78)
blackwinter commented 3 years ago

Also I have not seen an option numeric=true with an value that does not have quotation marks even boolean in metafacture flux.

Um, sorry, I'm not that well versed in Flux ;)

Exception in thread "main" java.lang.NumberFormatException: For input string: "_index"

What does your input look like? Are you sure you're sorting on the right field?

blackwinter commented 3 years ago

The counts from count-triples are in the OBJECT, aren't they?

TobiasNx commented 3 years ago

+1 You are right: | sort-triples(By="object",numeric="TRUE") works.

I didn't think about the reordering by the template-command. The error was due to the impossible task of counting letters instead of numbers.

Also | sort-triples(By="object",numeric="TRUE",order="DECREASING") works. That is great.

https://github.com/TobiasNx/notWorkingFlux/blob/52440eec4ecde1a3108f785bf6e7a2ec75b6eab6/sortTripplesNumeric/json-api-structure.flux

blackwinter commented 3 years ago

Great, thanks.