quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

Reconsider default sort order for readtext #58

Closed BobMuenchen closed 7 years ago

BobMuenchen commented 7 years ago

I just used this:

DATA_DIR <- "C:/Users/muenchen/Documents/R4TEXT/charisma/Behavior"
charisma_df <- readtext(DATA_DIR)

to read a set of files that contained numbers. I printed the top two docs and saw that the first was OK but the second was blank. Checking the raw file, I saw that it was not blank. I assumed there was a bug but I finally realized that the document in row #2 actually came from the 10th file. The default sort order in the data frame was:

BEHAV - #1.TXT BEHAV - #10.TXT

Rather than what I was expecting:

BEHAV - #1.TXT BEHAV - #2.TXT

This may be due to R's default sort order, but that may cause more confusion than it's worth.

BobMuenchen commented 7 years ago

Here's another impact of this sort problem.

> kwic(charisma_c, "vocal", 2)

[BEHAV - #1.TXT, 1]                  | Vocal | , eye           
[BEHAV - #10.TXT, 5]        contact, | vocal | inflection,     
[BEHAV - #113.TXT, 1]                | Vocal | variety,        
[BEHAV - #116.TXT, 5]     confident, | vocal | , cocky         
[BEHAV - #143.TXT, 5]  enthusiastic, | vocal | variety,        
[BEHAV - #147.TXT, 6]      with good | vocal | variety and     
[BEHAV - #173.TXT, 6]       contact, | vocal | variety         
[BEHAV - #174.TXT, 7]      gestures, | vocal | range,          
[BEHAV - #181.TXT, 5]       contact, | vocal | variety,        
[BEHAV - #197.TXT, 10]       as good | vocal | variety and     
[BEHAV - #208.TXT, 22]       well as | vocal | variety and     
[BEHAV - #22.TXT, 19]   expressions, | vocal | variety,        
[BEHAV - #225.TXT, 12]        have a | vocal | variety,        
[BEHAV - #255.TXT, 3]        Lots of | vocal | , well          
[BEHAV - #277.TXT, 2]      Different | vocal | tones,          
[BEHAV - #300.TXT, 1]                | Vocal | , articulate    
[BEHAV - #306.TXT, 4]  availability, | vocal | variety         
[BEHAV - #307.TXT, 1]                | Vocal | - Conversational
[BEHAV - #320.TXT, 4]       contact, | vocal | variety,        
[BEHAV - #386.TXT, 8]      animated, | vocal | variety,        
[BEHAV - #406.TXT, 3]     Facial and | vocal | expression      
[BEHAV - #6.TXT, 1]                  | Vocal | , good          
[BEHAV - #70.TXT, 32]         , good | vocal | variety,        
[BEHAV - #76.TXT, 7]     persuasive, | vocal | variety,        
[BEHAV - #8.TXT, 1]                  | Vocal | , voice         
[BEHAV - #96.TXT, 6]       gestures, | vocal | variety-    

So let's take a more detailed look at the last one above, document 96...

> charisma_c[96]
BEHAV - #185.TXT 

Oops, the 96th doc is really #185!

adamobeng commented 7 years ago

Interesting problem, thanks @BobMuenchen. readtext itself doesn't sort the files at all, it relies on the operating system's wildcard expansion to return lists of files. I think we want to continue to rely on whatever the operating system is doing rather than implementing it ourselves, because alphanumerical sorting is not straightforward when you're dealing with multiple and mixed languages and scripts.

With that said, I have two diagnostic questions:

1) Do the files already sort in the way you'd want in Windows Explorer? 2) Did you find a solution that works for you? It doesn't seem to me that we have an easy way of manually changing the sort order at the moment, so we might at least enable you to do that.

BobMuenchen commented 7 years ago

File Explorer sorts them as I was expecting, but since you said you were using the OS' wildcards I popped open a cmd window and used DIR for the first time in quite a while & sure enough, it sorted 1, 10, 100. I'm using this data for a workshop so to avoid confusion between filename and internal ID, I simply prefixed the file names with enough zeros. Now they sort 001, 002, 003... I expect having an ID number in a filename and stored internally is rare enough to not bother offering alternate input options for. Thanks!

On Sun, Mar 5, 2017 at 12:10 PM, Adam Obeng notifications@github.com wrote:

Interesting problem, thanks @BobMuenchen https://github.com/BobMuenchen. readtext itself doesn't sort the files at all, it relies on the operating system's wildcard expansion to return lists of files. I think we want to continue to rely on whatever the operating system is doing rather than implementing it ourselves, because alphanumerical sorting is not straightforward when you're dealing with multiple and mixed languages and scripts.

With that said, I have two diagnostic questions:

  1. Do the files already sort in the way you'd want in Windows Explorer?
  2. Did you find a solution that works for you? It doesn't seem to me that we have an easy way of manually changing the sort order at the moment, so we might at least enable you to do that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/readtext/issues/58#issuecomment-284244020, or mute the thread https://github.com/notifications/unsubscribe-auth/AC_53FbCZ3mQj4BcNgI40Q0Kr_z3YqUlks5riuyKgaJpZM4L-MoM .

--

================================================== Bob Muenchen (pronounced MINchen) Accredited Professional Statistician™ Manager, Research Computing Support, U of TN Voice: (865) 974-5230 Email: muenchen@utk.edu Twitter: @BobMuenchen Web Site: http://r4stats.com

kbenoit commented 7 years ago

I think renaming the files is the best option.