Closed BobMuenchen closed 7 years ago
Here's another impact of this sort problem.
> kwic(charisma_c, "vocal", 2)
[BEHAV - #1.TXT, 1] | Vocal | , eye
[BEHAV - #10.TXT, 5] contact, | vocal | inflection,
[BEHAV - #113.TXT, 1] | Vocal | variety,
[BEHAV - #116.TXT, 5] confident, | vocal | , cocky
[BEHAV - #143.TXT, 5] enthusiastic, | vocal | variety,
[BEHAV - #147.TXT, 6] with good | vocal | variety and
[BEHAV - #173.TXT, 6] contact, | vocal | variety
[BEHAV - #174.TXT, 7] gestures, | vocal | range,
[BEHAV - #181.TXT, 5] contact, | vocal | variety,
[BEHAV - #197.TXT, 10] as good | vocal | variety and
[BEHAV - #208.TXT, 22] well as | vocal | variety and
[BEHAV - #22.TXT, 19] expressions, | vocal | variety,
[BEHAV - #225.TXT, 12] have a | vocal | variety,
[BEHAV - #255.TXT, 3] Lots of | vocal | , well
[BEHAV - #277.TXT, 2] Different | vocal | tones,
[BEHAV - #300.TXT, 1] | Vocal | , articulate
[BEHAV - #306.TXT, 4] availability, | vocal | variety
[BEHAV - #307.TXT, 1] | Vocal | - Conversational
[BEHAV - #320.TXT, 4] contact, | vocal | variety,
[BEHAV - #386.TXT, 8] animated, | vocal | variety,
[BEHAV - #406.TXT, 3] Facial and | vocal | expression
[BEHAV - #6.TXT, 1] | Vocal | , good
[BEHAV - #70.TXT, 32] , good | vocal | variety,
[BEHAV - #76.TXT, 7] persuasive, | vocal | variety,
[BEHAV - #8.TXT, 1] | Vocal | , voice
[BEHAV - #96.TXT, 6] gestures, | vocal | variety-
So let's take a more detailed look at the last one above, document 96...
> charisma_c[96]
BEHAV - #185.TXT
Oops, the 96th doc is really #185!
Interesting problem, thanks @BobMuenchen. readtext itself doesn't sort the files at all, it relies on the operating system's wildcard expansion to return lists of files. I think we want to continue to rely on whatever the operating system is doing rather than implementing it ourselves, because alphanumerical sorting is not straightforward when you're dealing with multiple and mixed languages and scripts.
With that said, I have two diagnostic questions:
1) Do the files already sort in the way you'd want in Windows Explorer? 2) Did you find a solution that works for you? It doesn't seem to me that we have an easy way of manually changing the sort order at the moment, so we might at least enable you to do that.
File Explorer sorts them as I was expecting, but since you said you were using the OS' wildcards I popped open a cmd window and used DIR for the first time in quite a while & sure enough, it sorted 1, 10, 100. I'm using this data for a workshop so to avoid confusion between filename and internal ID, I simply prefixed the file names with enough zeros. Now they sort 001, 002, 003... I expect having an ID number in a filename and stored internally is rare enough to not bother offering alternate input options for. Thanks!
On Sun, Mar 5, 2017 at 12:10 PM, Adam Obeng notifications@github.com wrote:
Interesting problem, thanks @BobMuenchen https://github.com/BobMuenchen. readtext itself doesn't sort the files at all, it relies on the operating system's wildcard expansion to return lists of files. I think we want to continue to rely on whatever the operating system is doing rather than implementing it ourselves, because alphanumerical sorting is not straightforward when you're dealing with multiple and mixed languages and scripts.
With that said, I have two diagnostic questions:
- Do the files already sort in the way you'd want in Windows Explorer?
- Did you find a solution that works for you? It doesn't seem to me that we have an easy way of manually changing the sort order at the moment, so we might at least enable you to do that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/readtext/issues/58#issuecomment-284244020, or mute the thread https://github.com/notifications/unsubscribe-auth/AC_53FbCZ3mQj4BcNgI40Q0Kr_z3YqUlks5riuyKgaJpZM4L-MoM .
--
I think renaming the files is the best option.
I just used this:
to read a set of files that contained numbers. I printed the top two docs and saw that the first was OK but the second was blank. Checking the raw file, I saw that it was not blank. I assumed there was a bug but I finally realized that the document in row #2 actually came from the 10th file. The default sort order in the data frame was:
BEHAV - #1.TXT BEHAV - #10.TXT
Rather than what I was expecting:
BEHAV - #1.TXT BEHAV - #2.TXT
This may be due to R's default sort order, but that may cause more confusion than it's worth.