trinker / textreadr

Tools to uniformly read in text data including semi-structured transcripts
74 stars 5 forks source link

read_docx inserts numerical strings at start of line #19

Closed EFletcher2014 closed 4 years ago

EFletcher2014 commented 4 years ago

I am using read_docx(file) to read a Word Document into text. However, read_docx is adding strings of numbers to the start of some lines of my document. For example, this line:

Screenshot (51)

is being read as:

331724018605500560324018605500Excavation Unit Level SheetLevel Level 1 (0-16cmbd)Date7/12/2017

trinker commented 4 years ago

Are the strings always the same? Can you provide an example of the doc so that this can be recreated?

On Tue, Apr 21, 2020, 1:21 PM Emily Fletcher notifications@github.com wrote:

I am using read_docx(file) to read a Word Document into text. However, read_docx is adding strings of numbers to the start of some lines of my document. For example, this line:

[image: Screenshot (51)] https://user-images.githubusercontent.com/14348637/79893966-c930a280-83d2-11ea-9b98-638c6e9412b8.png

is being read as:

331724018605500560324018605500Excavation Unit Level SheetLevel Level 1 (0-16cmbd)Date7/12/2017

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/trinker/textreadr/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANOPTUYLVW6WCJ7HA63UYLRNXI3FANCNFSM4MNO4IVQ .

EFletcher2014 commented 4 years ago

Here is an example document (anonymized). I think the issue may be related to the fact that the document is a form? My corpus is multiple different copies of these documents, and each copy has the same string appended to the same line, but the lines within the document have different numerical strings appended to them TestReport.docx

trinker commented 4 years ago

Gotcha. It's because the cran version grabs the paragraph tag text and concatenates it. The dev version allows you to change the w:p tag for w:r or w:t tag and then you get better control.

Here is the output using these two tags:

textreadr::read_docx("TestReport - Copy.docx", nodetag = '//w:r')
  [1] "354330021272400"                                                   
  [2] "560070021272400"                                                   
  [3] "University"                                                        
  [4] "Unit No."                                                          
  [5] "Z87 L0"                                                            
  [6] "Area"                                                              
  [7] "1x2 M"                                                             
  [8] "331724018605500"                                                   
  [9] "560324018605500"                                                   
 [10] "Excavation Unit Level Sheet"                                       
 [11] "Level"                                                             
 [12] "Level"                                                             
 [13] "1 ("                                                               
 [14] "20"                                                                
 [15] "-"                                                                 
 [16] "2"                                                                 
 [17] "6cmbd)"                                                            
 [18] "Date"                                                              
 [19] "7/12/201"                                                          
 [20] "9"                                                                 
 [21] "355155517017900"                                                   
 [22] "Recorder"                                                          
 [23] "John Doe"                                                          
 [24] "363664515938500"                                                   
 [25] "45720017017900"                                                    
 [26] "Project"                                                           
 [27] "An Archaeology Project"                                            
 [28] "Excavators"                                                        
 [29] "John Doe"                                                          
 [30] "296672018097400"                                                   
 [31] "22860018097400"                                                    
 [32] "Site"                                                              
 [33] "A Site"                                                            
 [34] "Click here to enter text."                                         
 [35] "399796018097400"                                                   
 [36] "125476018097400"                                                   
 [37] "Plan View Below at:"                                               
 [38] "15cmbd"                                                            
 [39] "Digital Recorder"                                                  
 [40] "Jane Doe"                                                          
 [41] "R"                                                                 
 [42] "25 O9"                                                             
 [43] "(This Edge is North)"                                              
 [44] "R27 O10"                                                           
 [45] "Click here to enter text."                                         
 [46] "Material collected from this level:"                               
 [47] "Accession No."                                                     
 [48] "Click here"                                                        
 [49] "."                                                                 
 [50] "472948015684400"                                                   
 [51] "377761516319400"                                                   
 [52] "36068014858900"                                                    
 [53] "133985015938400"                                                   
 [54] "Cat. #"                                                            
 [55] "Click."                                                            
 [56] "Contents:"                                                         
 [57] "No artifacts collected"                                            
 [58] "Cat. #"                                                            
 [59] "Click."                                                            
 [60] "Contents:"                                                         
 [61] "Click here to enter text."                                         
 [62] "472884516001900"                                                   
 [63] "377761517271900"                                                   
 [64] "35369516192400"                                                    
 [65] "134302517335400"                                                   
 [66] "Cat. #"                                                            
 [67] "Click."                                                            
 [68] "Contents:"                                                         
 [69] "Click here to enter text."                                         
 [70] "Cat. #"                                                            
 [71] "Click."                                                            
 [72] "Contents:"                                                         
 [73] "Click here to enter text."                                         
 [74] "472948016446400"                                                   
 [75] "377761517335400"                                                   
 [76] "36449017398900"                                                    
 [77] "133985017017900"                                                   
 [78] "Cat. #"                                                            
 [79] "Click."                                                            
 [80] "Contents:"                                                         
 [81] "Click here to enter text."                                         
 [82] "Cat. #"                                                            
 [83] "Click."                                                            
 [84] "Contents:"                                                         
 [85] "Click here to enter text."                                         
 [86] "472948017462400"                                                   
 [87] "377761517335400"                                                   
 [88] "35433017335400"                                                    
 [89] "133985016954400"                                                   
 [90] "Cat. #"                                                            
 [91] "Click."                                                            
 [92] "Contents:"                                                         
 [93] "Click here to enter text."                                         
 [94] "Cat. #"                                                            
 [95] "Click."                                                            
 [96] "Contents:"                                                         
 [97] "Click here to enter text."                                         
 [98] "Excavation techniques used:"                                       
 [99] "text document testing"                                             
[100] "Nature of soil matrix:"                                            
[101] "Teesting"                                                          
[102] "test"                                                              
[103] "test"                                                              
[104] "Remarks on features, postholes, artifact content, special samples,"
[105] "etc"                                                               
[106] ":"                                                                 
[107] "test"                                                              
[108] "test"                                                              
[109] "document lorem ipsum."                                             
[110] "Photograph information:"                                           
[111] "no photos taken." 
textreadr::read_docx("TestReport - Copy.docx", nodetag = '//w:t')
 [1] "University"                                                        
 [2] "Unit No."                                                          
 [3] "Z87 L0"                                                            
 [4] "Area"                                                              
 [5] "1x2 M"                                                             
 [6] "Excavation Unit Level Sheet"                                       
 [7] "Level"                                                             
 [8] "Level"                                                             
 [9] "1 ("                                                               
[10] "20"                                                                
[11] "-"                                                                 
[12] "2"                                                                 
[13] "6cmbd)"                                                            
[14] "Date"                                                              
[15] "7/12/201"                                                          
[16] "9"                                                                 
[17] "Recorder"                                                          
[18] "John Doe"                                                          
[19] "Project"                                                           
[20] "An Archaeology Project"                                            
[21] "Excavators"                                                        
[22] "John Doe"                                                          
[23] "Site"                                                              
[24] "A Site"                                                            
[25] "Click here to enter text."                                         
[26] "Plan View Below at:"                                               
[27] "15cmbd"                                                            
[28] "Digital Recorder"                                                  
[29] "Jane Doe"                                                          
[30] "R"                                                                 
[31] "25 O9"                                                             
[32] "(This Edge is North)"                                              
[33] "R27 O10"                                                           
[34] "Click here to enter text."                                         
[35] "Material collected from this level:"                               
[36] "Accession No."                                                     
[37] "Click here"                                                        
[38] "."                                                                 
[39] "Cat. #"                                                            
[40] "Click."                                                            
[41] "Contents:"                                                         
[42] "No artifacts collected"                                            
[43] "Cat. #"                                                            
[44] "Click."                                                            
[45] "Contents:"                                                         
[46] "Click here to enter text."                                         
[47] "Cat. #"                                                            
[48] "Click."                                                            
[49] "Contents:"                                                         
[50] "Click here to enter text."                                         
[51] "Cat. #"                                                            
[52] "Click."                                                            
[53] "Contents:"                                                         
[54] "Click here to enter text."                                         
[55] "Cat. #"                                                            
[56] "Click."                                                            
[57] "Contents:"                                                         
[58] "Click here to enter text."                                         
[59] "Cat. #"                                                            
[60] "Click."                                                            
[61] "Contents:"                                                         
[62] "Click here to enter text."                                         
[63] "Cat. #"                                                            
[64] "Click."                                                            
[65] "Contents:"                                                         
[66] "Click here to enter text."                                         
[67] "Cat. #"                                                            
[68] "Click."                                                            
[69] "Contents:"                                                         
[70] "Click here to enter text."                                         
[71] "Excavation techniques used:"                                       
[72] "text document testing"                                             
[73] "Nature of soil matrix:"                                            
[74] "Teesting"                                                          
[75] "test"                                                              
[76] "test"                                                              
[77] "Remarks on features, postholes, artifact content, special samples,"
[78] "etc"                                                               
[79] ":"                                                                 
[80] "test"                                                              
[81] "test"                                                              
[82] "document lorem ipsum."                                             
[83] "Photograph information:"                                           
[84] "no photos taken."   
trinker commented 4 years ago

Note you're probably after nodetag = '//w:t'

EFletcher2014 commented 4 years ago

Thank you so much!!

trinker commented 4 years ago

@EFletcher2014 THe solution I gave adds too much complexity to a simple reader system. I decided to make this parsing the default. There is no longer a nodetag parameter. With the newest version, what you tried originally, will work. Thanks for the issue.

EFletcher2014 commented 4 years ago

Hello again! The past few days I've been using the old version with nodetag = '\\w:t' and that was working great. Today I tried to update to the new version but now read_docx does not return the input of my forms, regardless of if I include a nodetag or not. For example, when running on the form example I provided earlier, this is the output: textreadr::read_docx("TestReport.docx") [1] "University Unit No. Area" [2] "Excavation Unit Level Sheet Level Date" [3] "Recorder" [4] "Project Excavators" [5] "Site" [6] "Plan View Below at: Digital Recorder" [7] "(This Edge is North)" [8] "Material collected from this level: Accession No." [9] "Cat. # Contents: Cat. # Contents:" [10] "Cat. # Contents: Cat. # Contents:" [11] "Cat. # Contents: Cat. # Contents:" [12] "Cat. # Contents: Cat. # Contents:" [13] "Excavation techniques used:" [14] "Nature of soil matrix:" [15] "Remarks on features, postholes, artifact content, special samples, etc :"[16] "Photograph information:"

Any idea what could be causing this?

trinker commented 4 years ago

Sorry about that. Give it a try now (v. 0.9.5):

textreadr::read_docx('TestReport.docx')

 [1] "University Unit No.   Z87 L0 Area   1x2 M"                                                            
 [2] "Excavation Unit Level Sheet Level   Date 7/12/201"                                                    
 [3] "Recorder   John Doe"                                                                                  
 [4] "Project   Excavators  John Doe"                                                                       
 [5] "Site  A Site Click here to enter text."                                                               
 [6] "Plan View Below at:  15cmbd Digital Recorder  Jane Doe"                                               
 [7] "R (This Edge is North) R27 O10"                                                                       
 [8] "Click here to enter text."                                                                            
 [9] "Material collected from this level: Accession No.  Click here"                                        
[10] "Cat. # Click. Contents:  No artifacts collected Cat. #  Click. Contents: Click here to enter text."   
[11] "Cat. # Click. Contents: Click here to enter text. Cat. #  Click. Contents:  Click here to enter text."
[12] "Cat. # Click. Contents: Click here to enter text. Cat. #  Click. Contents:  Click here to enter text."
[13] "Cat. # Click. Contents: Click here to enter text. Cat. #  Click. Contents:  Click here to enter text."
[14] "Excavation techniques used:  text document testing"                                                   
[15] "Nature of soil matrix:  Teesting"                                                                     
[16] "Remarks on features, postholes, artifact content, special samples,  etc :  test"                      
[17] "Photograph information:  no photos taken."                                  
EFletcher2014 commented 4 years ago

Thanks for getting to this so quickly! It looks like it's still only pulling the first word in some sections, however. For example, [16] should be: [16] Remarks on features, postholes, artifact content, special samples, etc: test test document lorem ipsum. instead of: [16] Remarks on features, postholes, artifact content, special samples, etc: test

trinker commented 4 years ago

Ugg...oversite:

textreadr::read_docx(system.file("docs/TestReport.docx", package = "textreadr"))
 [1] "University Unit No.   Z87 L0 Area   1x2 M"                                                                   
 [2] "Excavation Unit Level Sheet Level   Level  1 ( 20 - 2 6cmbd) Date 7/12/201 9"                                
 [3] "Recorder   John Doe"                                                                                         
 [4] "Project   An Archaeology Project Excavators  John Doe"                                                       
 [5] "Site  A Site Click here to enter text."                                                                      
 [6] "Plan View Below at:  15cmbd Digital Recorder  Jane Doe"                                                      
 [7] "R 25 O9 (This Edge is North) R27 O10"                                                                        
 [8] "Click here to enter text."                                                                                   
 [9] "Material collected from this level: Accession No.  Click here ."                                             
[10] "Cat. # Click. Contents:  No artifacts collected Cat. #  Click. Contents: Click here to enter text."          
[11] "Cat. # Click. Contents: Click here to enter text. Cat. #  Click. Contents:  Click here to enter text."       
[12] "Cat. # Click. Contents: Click here to enter text. Cat. #  Click. Contents:  Click here to enter text."       
[13] "Cat. # Click. Contents: Click here to enter text. Cat. #  Click. Contents:  Click here to enter text."       
[14] "Excavation techniques used:  text document testing"                                                          
[15] "Nature of soil matrix:  Teesting  test  test"                                                                
[16] "Remarks on features, postholes, artifact content, special samples,  etc :  test  test  document lorem ipsum."
[17] "Photograph information:  no photos taken."             
trinker commented 4 years ago

try v 0.9.6 now

trinker commented 4 years ago

Thanks for testing and being patient

EFletcher2014 commented 4 years ago

It's working wonderfully now! Thanks again, happy to help!