Closed EFletcher2014 closed 4 years ago
Are the strings always the same? Can you provide an example of the doc so that this can be recreated?
On Tue, Apr 21, 2020, 1:21 PM Emily Fletcher notifications@github.com wrote:
I am using read_docx(file) to read a Word Document into text. However, read_docx is adding strings of numbers to the start of some lines of my document. For example, this line:
[image: Screenshot (51)] https://user-images.githubusercontent.com/14348637/79893966-c930a280-83d2-11ea-9b98-638c6e9412b8.png
is being read as:
331724018605500560324018605500Excavation Unit Level SheetLevel Level 1 (0-16cmbd)Date7/12/2017
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/trinker/textreadr/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANOPTUYLVW6WCJ7HA63UYLRNXI3FANCNFSM4MNO4IVQ .
Here is an example document (anonymized). I think the issue may be related to the fact that the document is a form? My corpus is multiple different copies of these documents, and each copy has the same string appended to the same line, but the lines within the document have different numerical strings appended to them TestReport.docx
Gotcha. It's because the cran version grabs the paragraph tag text and concatenates it. The dev version allows you to change the w:p tag for w:r or w:t tag and then you get better control.
Here is the output using these two tags:
textreadr::read_docx("TestReport - Copy.docx", nodetag = '//w:r')
[1] "354330021272400"
[2] "560070021272400"
[3] "University"
[4] "Unit No."
[5] "Z87 L0"
[6] "Area"
[7] "1x2 M"
[8] "331724018605500"
[9] "560324018605500"
[10] "Excavation Unit Level Sheet"
[11] "Level"
[12] "Level"
[13] "1 ("
[14] "20"
[15] "-"
[16] "2"
[17] "6cmbd)"
[18] "Date"
[19] "7/12/201"
[20] "9"
[21] "355155517017900"
[22] "Recorder"
[23] "John Doe"
[24] "363664515938500"
[25] "45720017017900"
[26] "Project"
[27] "An Archaeology Project"
[28] "Excavators"
[29] "John Doe"
[30] "296672018097400"
[31] "22860018097400"
[32] "Site"
[33] "A Site"
[34] "Click here to enter text."
[35] "399796018097400"
[36] "125476018097400"
[37] "Plan View Below at:"
[38] "15cmbd"
[39] "Digital Recorder"
[40] "Jane Doe"
[41] "R"
[42] "25 O9"
[43] "(This Edge is North)"
[44] "R27 O10"
[45] "Click here to enter text."
[46] "Material collected from this level:"
[47] "Accession No."
[48] "Click here"
[49] "."
[50] "472948015684400"
[51] "377761516319400"
[52] "36068014858900"
[53] "133985015938400"
[54] "Cat. #"
[55] "Click."
[56] "Contents:"
[57] "No artifacts collected"
[58] "Cat. #"
[59] "Click."
[60] "Contents:"
[61] "Click here to enter text."
[62] "472884516001900"
[63] "377761517271900"
[64] "35369516192400"
[65] "134302517335400"
[66] "Cat. #"
[67] "Click."
[68] "Contents:"
[69] "Click here to enter text."
[70] "Cat. #"
[71] "Click."
[72] "Contents:"
[73] "Click here to enter text."
[74] "472948016446400"
[75] "377761517335400"
[76] "36449017398900"
[77] "133985017017900"
[78] "Cat. #"
[79] "Click."
[80] "Contents:"
[81] "Click here to enter text."
[82] "Cat. #"
[83] "Click."
[84] "Contents:"
[85] "Click here to enter text."
[86] "472948017462400"
[87] "377761517335400"
[88] "35433017335400"
[89] "133985016954400"
[90] "Cat. #"
[91] "Click."
[92] "Contents:"
[93] "Click here to enter text."
[94] "Cat. #"
[95] "Click."
[96] "Contents:"
[97] "Click here to enter text."
[98] "Excavation techniques used:"
[99] "text document testing"
[100] "Nature of soil matrix:"
[101] "Teesting"
[102] "test"
[103] "test"
[104] "Remarks on features, postholes, artifact content, special samples,"
[105] "etc"
[106] ":"
[107] "test"
[108] "test"
[109] "document lorem ipsum."
[110] "Photograph information:"
[111] "no photos taken."
textreadr::read_docx("TestReport - Copy.docx", nodetag = '//w:t')
[1] "University"
[2] "Unit No."
[3] "Z87 L0"
[4] "Area"
[5] "1x2 M"
[6] "Excavation Unit Level Sheet"
[7] "Level"
[8] "Level"
[9] "1 ("
[10] "20"
[11] "-"
[12] "2"
[13] "6cmbd)"
[14] "Date"
[15] "7/12/201"
[16] "9"
[17] "Recorder"
[18] "John Doe"
[19] "Project"
[20] "An Archaeology Project"
[21] "Excavators"
[22] "John Doe"
[23] "Site"
[24] "A Site"
[25] "Click here to enter text."
[26] "Plan View Below at:"
[27] "15cmbd"
[28] "Digital Recorder"
[29] "Jane Doe"
[30] "R"
[31] "25 O9"
[32] "(This Edge is North)"
[33] "R27 O10"
[34] "Click here to enter text."
[35] "Material collected from this level:"
[36] "Accession No."
[37] "Click here"
[38] "."
[39] "Cat. #"
[40] "Click."
[41] "Contents:"
[42] "No artifacts collected"
[43] "Cat. #"
[44] "Click."
[45] "Contents:"
[46] "Click here to enter text."
[47] "Cat. #"
[48] "Click."
[49] "Contents:"
[50] "Click here to enter text."
[51] "Cat. #"
[52] "Click."
[53] "Contents:"
[54] "Click here to enter text."
[55] "Cat. #"
[56] "Click."
[57] "Contents:"
[58] "Click here to enter text."
[59] "Cat. #"
[60] "Click."
[61] "Contents:"
[62] "Click here to enter text."
[63] "Cat. #"
[64] "Click."
[65] "Contents:"
[66] "Click here to enter text."
[67] "Cat. #"
[68] "Click."
[69] "Contents:"
[70] "Click here to enter text."
[71] "Excavation techniques used:"
[72] "text document testing"
[73] "Nature of soil matrix:"
[74] "Teesting"
[75] "test"
[76] "test"
[77] "Remarks on features, postholes, artifact content, special samples,"
[78] "etc"
[79] ":"
[80] "test"
[81] "test"
[82] "document lorem ipsum."
[83] "Photograph information:"
[84] "no photos taken."
Note you're probably after nodetag = '//w:t'
Thank you so much!!
@EFletcher2014 THe solution I gave adds too much complexity to a simple reader system. I decided to make this parsing the default. There is no longer a nodetag
parameter. With the newest version, what you tried originally, will work. Thanks for the issue.
Hello again! The past few days I've been using the old version with nodetag = '\\w:t'
and that was working great. Today I tried to update to the new version but now read_docx
does not return the input of my forms, regardless of if I include a nodetag
or not. For example, when running on the form example I provided earlier, this is the output:
textreadr::read_docx("TestReport.docx") [1] "University Unit No. Area" [2] "Excavation Unit Level Sheet Level Date" [3] "Recorder" [4] "Project Excavators" [5] "Site" [6] "Plan View Below at: Digital Recorder" [7] "(This Edge is North)" [8] "Material collected from this level: Accession No." [9] "Cat. # Contents: Cat. # Contents:" [10] "Cat. # Contents: Cat. # Contents:" [11] "Cat. # Contents: Cat. # Contents:" [12] "Cat. # Contents: Cat. # Contents:" [13] "Excavation techniques used:" [14] "Nature of soil matrix:" [15] "Remarks on features, postholes, artifact content, special samples, etc :"[16] "Photograph information:"
Any idea what could be causing this?
Sorry about that. Give it a try now (v. 0.9.5):
textreadr::read_docx('TestReport.docx')
[1] "University Unit No. Z87 L0 Area 1x2 M"
[2] "Excavation Unit Level Sheet Level Date 7/12/201"
[3] "Recorder John Doe"
[4] "Project Excavators John Doe"
[5] "Site A Site Click here to enter text."
[6] "Plan View Below at: 15cmbd Digital Recorder Jane Doe"
[7] "R (This Edge is North) R27 O10"
[8] "Click here to enter text."
[9] "Material collected from this level: Accession No. Click here"
[10] "Cat. # Click. Contents: No artifacts collected Cat. # Click. Contents: Click here to enter text."
[11] "Cat. # Click. Contents: Click here to enter text. Cat. # Click. Contents: Click here to enter text."
[12] "Cat. # Click. Contents: Click here to enter text. Cat. # Click. Contents: Click here to enter text."
[13] "Cat. # Click. Contents: Click here to enter text. Cat. # Click. Contents: Click here to enter text."
[14] "Excavation techniques used: text document testing"
[15] "Nature of soil matrix: Teesting"
[16] "Remarks on features, postholes, artifact content, special samples, etc : test"
[17] "Photograph information: no photos taken."
Thanks for getting to this so quickly! It looks like it's still only pulling the first word in some sections, however. For example, [16] should be:
[16] Remarks on features, postholes, artifact content, special samples, etc: test test document lorem ipsum.
instead of:
[16] Remarks on features, postholes, artifact content, special samples, etc: test
Ugg...oversite:
textreadr::read_docx(system.file("docs/TestReport.docx", package = "textreadr"))
[1] "University Unit No. Z87 L0 Area 1x2 M"
[2] "Excavation Unit Level Sheet Level Level 1 ( 20 - 2 6cmbd) Date 7/12/201 9"
[3] "Recorder John Doe"
[4] "Project An Archaeology Project Excavators John Doe"
[5] "Site A Site Click here to enter text."
[6] "Plan View Below at: 15cmbd Digital Recorder Jane Doe"
[7] "R 25 O9 (This Edge is North) R27 O10"
[8] "Click here to enter text."
[9] "Material collected from this level: Accession No. Click here ."
[10] "Cat. # Click. Contents: No artifacts collected Cat. # Click. Contents: Click here to enter text."
[11] "Cat. # Click. Contents: Click here to enter text. Cat. # Click. Contents: Click here to enter text."
[12] "Cat. # Click. Contents: Click here to enter text. Cat. # Click. Contents: Click here to enter text."
[13] "Cat. # Click. Contents: Click here to enter text. Cat. # Click. Contents: Click here to enter text."
[14] "Excavation techniques used: text document testing"
[15] "Nature of soil matrix: Teesting test test"
[16] "Remarks on features, postholes, artifact content, special samples, etc : test test document lorem ipsum."
[17] "Photograph information: no photos taken."
try v 0.9.6 now
Thanks for testing and being patient
It's working wonderfully now! Thanks again, happy to help!
I am using read_docx(file) to read a Word Document into text. However, read_docx is adding strings of numbers to the start of some lines of my document. For example, this line:
is being read as:
331724018605500560324018605500Excavation Unit Level SheetLevel Level 1 (0-16cmbd)Date7/12/2017