ropensci / rtika

R Interface to Apache Tika
https://docs.ropensci.org/rtika
Apache License 2.0
54 stars 8 forks source link

tika() adds an extention of .txt to all input directories #11

Closed Oneiricer closed 5 years ago

Oneiricer commented 5 years ago

Hi, Not sure if i've encountered a bug or user error, but i am getting the following error:

`batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") )

text <- tika_text(batch)`

Result:

1: In normalizePath(path.expand(path), winslash, mustWork) : path[1]="C:/Users/tsang/AppData/Local/Temp/RtmpY7ITct/rtika_dir2b1c6f312d39/\\air.gov.au/DFS/UserData/VIC/tsang/Documents/R/win-library/3.5/rtika/extdata/jsonlite.pdf.txt": The system cannot find the path specified 2: In normalizePath(path.expand(path), winslash, mustWork) : path[2]="C:/Users/tsang/AppData/Local/Temp/RtmpY7ITct/rtika_dir2b1c6f312d39/\\air.gov.au/DFS/UserData/VIC/tsang/Documents/R/win-library/3.5/rtika/extdata/curl.pdf.txt": The system cannot find the path specified 3: In normalizePath(path.expand(path), winslash, mustWork) : path[3]="C:/Users/tsang/AppData/Local/Temp/RtmpY7ITct/rtika_dir2b1c6f312d39/\\air.gov.au/DFS/UserData/VIC/tsang/Documents/R/win-library/3.5/rtika/extdata/table.docx.txt": The system cannot find the path specified 4: In normalizePath(path.expand(path), winslash, mustWork) : path[4]="C:/Users/tsang/AppData/Local/Temp/RtmpY7ITct/rtika_dir2b1c6f312d39/\\air.gov.au/DFS/UserData/VIC/tsang/Documents/R/win-library/3.5/rtika/extdata/xml2.pdf.txt": The system cannot find the path specified 5: In normalizePath(path.expand(path), winslash, mustWork) : path[5]="C:/Users/tsang/AppData/Local/Temp/RtmpY7ITct/rtika_dir2b1c6f312d39/\\air.gov.au/DFS/UserData/VIC/tsang/Documents/R/win-library/3.5/rtika/extdata/R-FAQ.html.txt": The system cannot find the path specified 6: In normalizePath(path.expand(path), winslash, mustWork) : path[6]="C:/Users/tsang/AppData/Local/Temp/RtmpY7ITct/rtika_dir2b1c6f312d39/\\air.gov.au/DFS/UserData/VIC/tsang/Documents/R/win-library/3.5/rtika/extdata/calculator.jpg.txt": The system cannot find the path specified 7: In normalizePath(path.expand(path), winslash, mustWork) : path[7]="C:/Users/tsang/AppData/Local/Temp/RtmpY7ITct/rtika_dir2b1c6f312d39/\\air.gov.au/DFS/UserData/VIC/tsang/Documents/R/win-library/3.5/rtika/extdata/tika.apache.org.zip.txt": The system cannot find the path specified

goodmansasha commented 5 years ago

I think the reason is the '\air.gov.au' in the path is not being processed correctly by the 'normalizePath()' function. I'll try and reproduce this on my windows machine and find a solution.

Oneiricer commented 5 years ago

Thanks for looking into this for me. I've tried to use other text mining packages (tidytext, tesseract, quanteda) they seem to work quite well with defining a specific location. But I think rtika works off a relative location?

e.g. this seems to work for the other 3 packages:

dest <-"H:\R\R scrap\PDF Text" myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)

i then proceed to feed the myfiles into the respective package's import function.

goodmansasha commented 5 years ago

I realize this is frustrating, but bear with me.

You definitely should be able to list the files and send them to rtika, either as relative or absolute paths on the C drive. I suspect the issue on Windows is when the files are on other drives, like an H drive.

Could you please send me the values for the 'batch' variable above?

goodmansasha commented 5 years ago

Please send the the output when you run this code. It will help me understand and debug a little.

` batch <- c( system.file("extdata", "jsonlite.pdf", package = "rtika"), system.file("extdata", "curl.pdf", package = "rtika"), system.file("extdata", "table.docx", package = "rtika"), system.file("extdata", "xml2.pdf", package = "rtika"), system.file("extdata", "R-FAQ.html", package = "rtika"), system.file("extdata", "calculator.jpg", package = "rtika"), system.file("extdata", "tika.apache.org.zip", package = "rtika") )

batch

normalizePath("/", winslash = "/")

text <- tika_text(batch, quiet=FALSE) `

no-more-hacks commented 5 years ago

Hi, I have a similar problem running on a windows 10 machine.

Now that I run your example with the quiet=FALSE I think i see the issue:

the crucial part being:

normalizePath("/", winslash = "/") [1] "\\companyname.co.uk/london/"

where I've replaced my company's real name with companyname...

I think that we have some sort of roaming profile in our corporate network that is messing things up.

and below the full output

thanks, sam

`

batch <- c(

  • system.file("extdata", "jsonlite.pdf", package = "rtika"),
  • system.file("extdata", "curl.pdf", package = "rtika"),
  • system.file("extdata", "table.docx", package = "rtika"),
  • system.file("extdata", "xml2.pdf", package = "rtika"),
  • system.file("extdata", "R-FAQ.html", package = "rtika"),
  • system.file("extdata", "calculator.jpg", package = "rtika"),
  • system.file("extdata", "tika.apache.org.zip", package = "rtika")
  • )

batch [1] "C:/source/r/R-3.5.0/library/rtika/extdata/jsonlite.pdf" "C:/source/r/R-3.5.0/library/rtika/extdata/curl.pdf"
[3] "C:/source/r/R-3.5.0/library/rtika/extdata/table.docx" "C:/source/r/R-3.5.0/library/rtika/extdata/xml2.pdf"
[5] "C:/source/r/R-3.5.0/library/rtika/extdata/R-FAQ.html" "C:/source/r/R-3.5.0/library/rtika/extdata/calculator.jpg"
[7] "C:/source/r/R-3.5.0/library/rtika/extdata/tika.apache.org.zip"

normalizePath("/", winslash = "/") [1] "\\companyname.co.uk/london/"

text <- tika_text(batch, quiet=FALSE) INFO about to start driver BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml INFO BatchProcess: Feb 26, 2019 6:07:56 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. INFO BatchProcess: See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io INFO BatchProcess: for optional dependencies. INFO BatchProcess: INFO BatchProcess: Feb 26, 2019 6:07:56 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: org.xerial's sqlite-jdbc is not loaded. INFO BatchProcess: Please provide the jar on your classpath to parse sqlite files. INFO BatchProcess: See tika-parsers/pom.xml for the correct version. INFO BatchProcess: randomCrawl attribute is ignored by FSListCrawler BatchProcess:BatchProcess starting up BatchProcess:Exception in FileResourceCrawler: Illegal char <:> at index 23: \companyname.co.uk\london\C:/source/r/R-3.5.0/library/rtika/extdata/jsonlite.pdf BatchProcess:java.nio.file.InvalidPathException: Illegal char <:> at index 23: \companyname.co.uk\london\C:/source/r/R-3.5.0/library/rtika/extdata/jsonlite.pdf BatchProcess: at sun.nio.fs.WindowsPathParser.normalize(Unknown Source) BatchProcess: at sun.nio.fs.WindowsPathParser.parse(Unknown Source) BatchProcess: at sun.nio.fs.WindowsPathParser.parse(Unknown Source) BatchProcess: at sun.nio.fs.WindowsPath.parse(Unknown Source) BatchProcess: at sun.nio.fs.WindowsFileSystem.getPath(Unknown Source) BatchProcess: at java.nio.file.Paths.get(Unknown Source) BatchProcess: at org.apache.tika.batch.fs.FSListCrawler.start(FSListCrawler.java:93) BatchProcess: at org.apache.tika.batch.FileResourceCrawler.call(FileResourceCrawler.java:79) BatchProcess: at org.apache.tika.batch.FileResourceCrawler.call(FileResourceCrawler.java:30) BatchProcess: at java.util.concurrent.FutureTask.run(Unknown Source) BatchProcess: at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) BatchProcess: at java.util.concurrent.FutureTask.run(Unknown Source) BatchProcess: at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) BatchProcess: at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) BatchProcess: at java.lang.Thread.run(Unknown Source) BatchProcess:Main thread in TikaFSBatchCLI has finished processing. BatchProcess: BatchProcess: BatchProcess:ParallelFileProcessingResult{considered=0, added=0, consumed=0, numberHandledExceptions=0, secondsElapsed=0.021, exitStatus=0, causeForTermination='COMPLETED_NORMALLY'} INFO The child process has finished with an exit value of: 0 INFO Process driver has completed Warning messages: 1: In normalizePath(path.expand(path), winslash, mustWork) : path[1]="C:/Users/sam/AppData/Local/Temp/Rtmp4E9J9F/rtika_dir57882ab5c19/C:/source/r/R-3.5.0/library/rtika/extdata/jsonlite.pdf.txt": The filename, directory name, or volume label syntax is incorrect 2: In normalizePath(path.expand(path), winslash, mustWork) : path[2]="C:/Users/sam/AppData/Local/Temp/Rtmp4E9J9F/rtika_dir57882ab5c19/C:/source/r/R-3.5.0/library/rtika/extdata/curl.pdf.txt": The filename, directory name, or volume label syntax is incorrect 3: In normalizePath(path.expand(path), winslash, mustWork) : path[3]="C:/Users/sam/AppData/Local/Temp/Rtmp4E9J9F/rtika_dir57882ab5c19/C:/source/r/R-3.5.0/library/rtika/extdata/table.docx.txt": The filename, directory name, or volume label syntax is incorrect 4: In normalizePath(path.expand(path), winslash, mustWork) : path[4]="C:/Users/sam/AppData/Local/Temp/Rtmp4E9J9F/rtika_dir57882ab5c19/C:/source/r/R-3.5.0/library/rtika/extdata/xml2.pdf.txt": The filename, directory name, or volume label syntax is incorrect 5: In normalizePath(path.expand(path), winslash, mustWork) : path[5]="C:/Users/sam/AppData/Local/Temp/Rtmp4E9J9F/rtika_dir57882ab5c19/C:/source/r/R-3.5.0/library/rtika/extdata/R-FAQ.html.txt": The filename, directory name, or volume label syntax is incorrect 6: In normalizePath(path.expand(path), winslash, mustWork) : path[6]="C:/Users/sam/AppData/Local/Temp/Rtmp4E9J9F/rtika_dir57882ab5c19/C:/source/r/R-3.5.0/library/rtika/extdata/calculator.jpg.txt": The filename, directory name, or volume label syntax is incorrect 7: In normalizePath(path.expand(path), winslash, mustWork) : path[7]="C:/Users/sam/AppData/Local/Temp/Rtmp4E9J9F/rtika_dir57882ab5c19/C:/source/r/R-3.5.0/library/rtika/extdata/tika.apache.org.zip.txt": The filename, directory name, or volume label syntax is incorrect `

goodmansasha commented 5 years ago

@sam-m-gardiner This info is very helpful. I'm working on it now and think I've found a solution that works across operating systems.

goodmansasha commented 5 years ago

The issue was in my understanding of how normalizePath() works with different input. I've patched the package to sidestep the issue. You can download and test it out by installing devtools and running devtools::install_github("ropensci/rtika").

no-more-hacks commented 5 years ago

I have tried out this new version from github using the test code above and it works! I am running on a corporate laptop and it works when connected to the network (so maybe using a remote temp folder??) and also when disconnected as well.

Thanks for the amazing turnaround time..

Sam

i now get this output:

INFO about to start driver BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml INFO BatchProcess: Feb 27, 2019 10:56:25 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. INFO BatchProcess: See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io INFO BatchProcess: for optional dependencies. INFO BatchProcess: INFO BatchProcess: Feb 27, 2019 10:56:25 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem INFO BatchProcess: WARNING: org.xerial's sqlite-jdbc is not loaded. INFO BatchProcess: Please provide the jar on your classpath to parse sqlite files. INFO BatchProcess: See tika-parsers/pom.xml for the correct version. INFO BatchProcess: randomCrawl attribute is ignored by FSListCrawler BatchProcess:BatchProcess starting up BatchProcess:Processed 0 documents in 1 second. BatchProcess:There have been 0 handled exceptions. BatchProcess:There are 2 file processors still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Processed 2 documents in 2 seconds. BatchProcess:There have been 0 handled exceptions. BatchProcess:There are 2 file processors still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:No Unicode mapping for asciigrave.Var (96) in font PYEYFS+Inconsolata-zi4r BatchProcess:Processed 4 documents in 3 seconds. BatchProcess:There have been 0 handled exceptions. BatchProcess:There are 2 file processors still active. BatchProcess:The directory crawler has considered 7 files, and it has added 7 files. BatchProcess: BatchProcess:The directory crawler has completed its crawl. BatchProcess: BatchProcess:Main thread in TikaFSBatchCLI has finished processing. BatchProcess: BatchProcess: BatchProcess:ParallelFileProcessingResult{considered=7, added=7, consumed=7, numberHandledExceptions=0, secondsElapsed=3.737, exitStatus=0, causeForTermination='COMPLETED_NORMALLY'} INFO The child process has finished with an exit value of: 0 INFO Process driver has completed

goodmansasha commented 5 years ago

Glad it works. I'll close this bug.

The files Tika produces are stored in the typical place R is setup to use for temporary files.

The bug fix here was to properly reference the files that Tika produced in the temporary folder.