omegahat / XML

The XML package for R
Other
20 stars 11 forks source link

Unable to get all files using getHTMLExternalFiles #10

Open FinScience opened 7 years ago

FinScience commented 7 years ago

Hi,

I have a SVN repository having multiple sub directories. I am looking to get the path of all xml files present in the sub directories. I tried using the following, but I am getting only the links and not the file paths. Any help would be awesome.

             doc <- htmlParse(path)
             getHTMLExternalFiles(doc,xpQuery = "//a/@href",recursive = T)

This function is giving me the result as

           [1] "../"                           "Olympus/"                      "Reflections/"                 
           [4] "StopTimePref/"                 "http://subversion.tigris.org/"

The repository images are shared.

image 3

image 2

image 1

image

I want to get the full path of the xml files.

duncantl commented 6 years ago

The links are relative in the HTML documents. You have to make them absolute. You can use getRelativeURL(link, baseDoc) . You know the baseDoc when you are parsing the file, or from docName() on the already parsed document.