Closed BrutalSimplicity closed 8 years ago
Hi, thanks for the feedback :) you are a mind reader, this is something that I have wanted to implement ever since I saw a plugin for Atom that does it, and I was thinking earlier about how it could be accomplished... However, in my opinion, the Atom version is missing a trick by not moving the selection cursor to the chosen node matching the xpath query expression when you click on it. If you are happy to give it a go, please feel free - I think it would nicely complete the plugin :)
Forgot to mention that the difficulty I foresee in Python is that the xml parsers don't seem to retain/surface line and column (or even character) position information...
Yea, you're right, I've only done a short search online, and will continue to look for viable (hopefully simple) solutions, but the only solutions I've seen require rolling your own function that will dig into the parser's internals. However, I did see that you can get back the line number using lxml (http://stackoverflow.com/questions/6949395/is-there-a-way-to-get-a-line-number-from-an-elementtree-element), but I'm thinking the column number would also be important (i.e. multiple tags on a single line).
Nice find, looks like there are some useful answers there :) but as you say, column number still seems non-trivial. I've seen that a sax parser might help (http://stackoverflow.com/questions/15477363/xml-sax-parser-and-line-numbers-etc), but I am not terribly keen on using external libraries tbh, I think the source code for them would have to be manually included in our repo (due to the way sublime's python architecture is implemented), which I don't want to do as it would be a nightmare to maintain when the library is updated etc. I've created a new branch in my fork with an initial proof of concept, with some comments which explain some ideas I have had on how we could choose to solve this problem :)
Do you mean this to say that lxml would not be a good choice due to Sublime's underlying Python implementation?
If so, is there any chance of using pip with the plugin install?
It sure would be nice to have an XPath 1.0 compliant query engine. It seems that ElementTree only supports a small subset of the XPath spec.
It seems that we might be able to use lxml quite easily after all, by setting it as a dependency for Package Control to pick up :+1: https://packagecontrol.io/docs/dependencies
EDIT: I tried, but unfortunately it doesn't work:
Created dependencies.json
with the following contents:
{
"*": {
"*": [
"lxml"
]
}
}
Restarted sublime text and got an error in the console:
Package Control: Installing 1 missing dependency
Package Control: The dependency "lxml" is either not available on this platform or for this version of Sublime Text
I have asked the dependency maintainer to include the windows x64 bit version, which will fix the problem in my comment above and allow us to easily reference lxml :)
Oh wow, that is awesome. Just curious, but will the fact that it uses libxml2 binaries be an issue. I'm guessing this is why you are just asking for Windows x64. I know that libxml2 will work on Mac OS as well, and likely Linux (haven't tried though).
If that is not an issue, then this is great news. I use basic XPath quite often at my job so you're plugin already has tremendous value, but I also use XPath for web scraping, where the expressions can get quite complex, so having the ability to use an XPath-compliant query engine against XML/HTML sources will make this plugin indispensable for me, and I'm sure many others.
Also, if there are no issues with using lxml in the plugin, I know of a way we can use its XPath engine to find the line/column number of the selected nodes. Essentially we would use you're new commit that parses the XML and adds tag/line information to the elements, and then when you run the XPath though lxml it will give you back the selected elements. From there we can iterate through the elements and call get path, which will return the XPath of the elements in node-hierarchy form (like your plugin currently does). Now, it's just a matter of matching that hierarchy against the regular python XPath engine and referencing the found nodes against the nodes in the parse tree with line/column information.
I'm not sure if this makes sense. I have to head to work now, so I'm kind of rushing. But essentially this works around the line/column limitation in lxml.
Yes, you are right about libxml2 binaries - the dependency package already included them for linux and mac :) it was just missing windows, but this binary has now been added successfully :) I wouldn't want to exclude any potential users by restricting what platform the plugin works on :)
I like your solution, very clever ;) I do wonder about performance though, parsing the xml twice and generating and looking up xpaths... I was also considering that maybe line number is enough without column, because I would imagine most users would manually run a pretty print command before working with the xml anyway... ;) Maybe I will be able to find an ideal solution, I haven't given up yet :)
Yea, I dislike the idea of having to parse a document twice just to evaluate the xpath expressions, but I couldn't think of a good way to get around the line column issue. But, I do agree that the source line might be enough information, if we force pretty-print on the xml input we might be able to get around that issue. The only problem I foresee with that is that pretty print seems to have a difficult time with some HTML so the line number may be difficult to depend on.
I do plan on working on the code to try out some of the techniques discussed, but this is the first time I've worked with the sublime plugin API so there is a knowledge gap I have to overcome. I hope to spend some time this week getting more familiar.
Also one thing that just came to mind is that although we may have to parse the Xml twice initially, from that point forward we only need to re-parse the Xml when the document changes. So for this particular case the above solution should be pretty fast.
Also, how do you see this plugin functioning? Will it be some kind of live query like sublime does for it's "Find" functionality, or will the user be required to explicitly execute the XPath query. In the latter case, I think the double-parse becomes even less critical, but I could definitely understand why a live search would be ideal (actually, one of my favorite Sublime features). We could also use some kind of option to trigger one or the other, so that users querying larger Xml/Html documents don't get as much of a performance hit.
I believe I have achieved it with only parsing the xml once! :) Wow, that was not easy! If you can find a nicer/better/cleaner way, please do!
I have built it like you suggested, with a preference whether or not to use live search. It still needs a lot of work, for example your idea to improve it by only (re-)parsing the xml when the document changes etc. And I think adding a (configurable, small) delay to the live search would be a good idea so that if you type /example
, for example, then it won't try to perform the query on /
, /e
, /ex
etc. as you are typing, but wait for a pause in input. And I think that this should also have the added benefit of being more responsive, because the actual query could run on a different thread to the user-input.
This is some awesome thought and work you guys have put into this. I'm excited to see what comes out of it!
I'm getting close to a version that I'm happy with on my query
branch :)
Would you like to help out with testing my changes?
So far I have found one side-effect that I am not sure how to easily fix. Because it is using an XML parser, HTML documents that have embedded <script>
tags often throw a parse error, where <
and >
operators are used, because the document isn't valid XML.
Hey sorry, for the absence, you're workaround is very clever. Last I checked, you are using the namespace to store the location information, as you parse the doc? Very cool!
Now, as to your script issue. Do you think it might be possible to regex for the script open/end tags before you begin parsing, and then embed a CDATA tag around the script code so that the parser overlooks it. And then as you parse, if you run into a script begin/end tags, update an offset to the line/col information to account for the additional characters. I know this is not an ideal workaround, because now you're having to keep track of an offset for the line/col that could get messy, but I think with your current approach this might workout well. What do you think?
And of course, I would be happy to test. I'll spend time this weekend messing around with it, and let you know how it goes. If you have any particular cases or data you want me to use, let me know.
That's right, I was storing the open tag start position in a namespace, well spotted :) I have since changed it to store tag positions in attributes with a specific namespace instead, as I wanted to store more than just open tag start position. (and I could only store namespace information before the element is initially created)
Your regex idea should work, thanks for that. I think that first I will see if lxml has a sax html parser I can use, it should be a lot less work if it does, and have the advantage of keeping the code easier to understand ;)
I have mainly tested with XPath expressions that return an element nodeset. I have also tested the count function, but that is about it. So if you are able to test something returning a string or boolean, it would be awesome! Am I right in thinking that in XPath 1.0, it is not possible to return more than one item except if the result is a nodeset? For example, you can't craft an expression that could return multiple numeric values? And returning multiple text strings still counts as a nodeset containing text nodes, where you can still lookup parent element information? My understanding is that lxml's ElementTree doesn't distinguish between node types anyway, so I don't have to worry about it too much... If I am wrong about all these beliefs, some test cases to work with would be amazing! :+1:
Thanks for your time and assistance ;)
Am I right in thinking that in XPath 1.0, it is not possible to return more than one item except if the result is a nodeset? I believe you are correct that XPath 1.0 only returns nodesets. However, there are range of things it classifies as nodes. According to the XPath 1.0 specification Data Models section, you can have seven types of nodes.
If you are already aware of this information, then forgive me for regurgitating, I just wanted to make sure I was answering your question completely.
Am I right in thinking that in XPath 1.0, it is not possible to return more than one item except if the result is a nodeset? For example, you can't craft an expression that could return multiple numeric values?
If I'm interpreting correctly, you are asking if it is possible to return a numeric type, and from the reference above I believe you are correct in your reasoning. You should note that it is possible to return numeric values, booleans, strings, and nodesets (not many do this) from functions, but XPath will only query nodesets. For example, this XPath would be invalid:
//table/count(tr)
This would work:
//table[count(tr)>2]
And returning multiple text strings still counts as a nodeset containing text nodes, where you can still lookup parent element information? That is correct.
So I've been using this from time-to-time in my day-to-day work, and it's working really well. I don't do too much namespace-aware querying so I haven't gotten around to testing that functionality much, but for Xml-compliant documents it looks like it's working really well. I've still been banging my head trying to figure out a way to use lxml to provide line/column information with one pass, but I still haven't had much success. The fact that you were able to create a solution that does this with one pass is amazing! I imagine it has not been easy to code.
As far as error testing, there are a few things I wanted to bring to your attention.
Datasets I used: ebay.xml.txt mondial-3.0.xml.txt football.html.txt (taken from http://www.pro-football-reference.com/boxscores/201409280pit.htm) football-clean.html.txt craigslist.html.txt (taken from https://dallas.craigslist.org/search/sya) craigslist-clean.html.txt
All of the XML test cases worked correctly ebay
//listing/seller_info[number(seller_rating)>400]/seller_name
//listing/item_info[number(normalize-space(substring-before(hard_drive,'GB')))>30]/description
mondial-3.0
//country[@name='Cameroon']//province/city[number(population)>15000]/name
Just a photo showing it working.
//country[number(@infant_mortality)<5]/name
//country[number(@infant_mortality)<5]/name/text()
//country/@name
).Now for the fun stuff.... The HTML gave a little trouble, but I think most of the issues can be solved with a bit of HTML tidying. The first issue occurs with malformed HTML. Take a look at football.html and craigslist.html to see an example. When you open this in the plugin it errors when parsing the Xml, however this can be solved using beautiful soup's prettify. It's certainly not an ideal solution, but I think if you can somehow notify the user that the parser has broken because the html may be malformed and give them the option of transforming it this might be a viable solution. It also turns out that in addition the malformed html the parser also does not like certain html entities. I think the valid Xml entities are ['quot', 'amp', 'apos', 'lt', 'gt']
and everything else it throws an error on. So I created a small script shown below that will execute the beautiful soup parser and use a regex to place a CDATA tag around the invalid entities. This seemed to work well enough on these two very badly formed html sample sets. I also verified a few of the XPaths against the lxml parser, and received the same results so I don't think there would be an issue where an XPath works in your plugin but fails or returns a different result set using lxml.
from BeautifulSoup import BeautifulSoup
import re
valid_entities = ['quot', 'amp', 'apos', 'lt', 'gt']
def clean(s):
soup = BeautifulSoup(s)
result = soup.prettify()
result = re.sub(r'&(\w+);',
lambda m:
m.group(0)
if m.group(1) in valid_entities
else '<![CDATA['+m.group(0)+']]>',
result)
return result
craigslist-clean
//p[number(substring-after(a/span[@class="price"], "$"))<300]//span[@class="pl"]/a
//p[.//span[@class="pnr"]//text()[contains(.,"Waxahachie")]]//span[@class="pl"]/a
football-clean
//a[@name="snap_counts"]/following::div[.//div[@class="table_heading"]//text()[contains(translate(.,"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz","ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"),"STEELERS")]][1]//table/tbody/tr
//table[@id="pbp_data"]/tbody/tr[not(contains(@class,"thead"))]
A few additions (maybe not for the initial release, but something to think about if it wouldn't be too difficult to add):
query: //Car[position()>1]
<Car id="1">
<Make>Ford</Make>
<Model>Mustang</Make>
<Year>2004</Year>
</Car>
<Car id="2">
<Make>Ford</Make>
<Model>Mustang</Make>
<Year>2008</Year>
</Car>
<Car id="3">
<Make>Ford</Make>
<Model>Mustang</Make>
<Year>20010</Year>
</Car>
I would be able to copy the hierarchy of each node, resulting in:
<Car id="2">
<Make>Ford</Make>
<Model>Mustang</Make>
<Year>2008</Year>
</Car>
<Car id="3">
<Make>Ford</Make>
<Model>Mustang</Make>
<Year>20010</Year>
</Car>
I'm thinking the last one might be a bit challenging, and could perhaps be added to a later release. But thought it would be useful for extracting and viewing just the particular information from the XPath query.
Again, great job man, and thanks for the contribution. You've made this tool absolutely priceless for my future Xml/Html tinkering, and I'm sure many others will feel the same. Very well done!
Hi @BrutalSimplicity, thanks very much for the awesome test cases - you've clearly put a lot of time, effort and thought into it, I really appreciate it :+1: you are far more knowledgeable at XPath than me, and I value your expertise ;)
I have fixed the issue whereby a query returning a nodeset containing text nodes or attribute nodes wasn't showing the results. Please see my latest commits :) Note that it's not perfect with how it navigates to the node - there is definitely room for improvement there, but it displays it correctly ;)
Thanks for the html tidying code, it looks great - I will experiment with that when I get chance :)
I like your idea about storing query history, I would love that too. Unfortunately I'm not sure the best way to achieve this in terms of the UI, Sublime's find and replace dialog has a dropdown list where you can browse the history, but from the plugin, we are restricted to a simple textbox...
There is a way to select all results from the query, from which you can then use the normal "copy xpaths at cursors to clipboard" functionality, but it's not exactly accessible. You can set a preference show_query_results
to false
, and it will skip showing you the results, and just select all the relevant nodes. I need to think about a nicer way to expose this functionality, any ideas would be very welcome :) Again, due to UI limitations in the plugin API, I can't just add a nice button or something by the textbox.
As for copying the element hierarchy of all selected nodes, I can see that it would be useful to quickly extract the relevant nodes from the rest of the document, excellent idea! :) the above answer applies again about how to expose the functionality while querying the document. However, in the meantime, I will have a go at adding it separately, so that it is not tied into the query functionality, just relying on the multiple selections :) (my solution will be to create an entry in the Command Pallete for it)
Thanks once again!
I have discovered that a nodeset can contain multiple node types, so we have a new test case, even though it may not be that useful in a real-life situation:
mondial-3.0
/mondial/country/name|/mondial/country/@id
I have added a command in the command pallete called "Select entire element", which, in combination with the "show_query_results": false
preference mentioned previously, will allow you to select the element hierarchy after a successful xpath query, so that you can easily copy them to the clipboard.
Regarding cleaning HTML, I have hit a bit of a roadblock, as the BeautifulSoup module isn't included with Sublime Text, and nobody has created a package/dependency for it on packagecontrol.io yet. The source only seems to come in Python 2 flavor by default, and I've never used this module before, so I am reluctant to be the one in charge of maintaining it for Sublime use... Any volunteers? Or do you think it would be best to not include @BrutalSimplicity's idea to
notify the user that the parser has broken because the html may be malformed and give them the option of transforming it
as part of this plugin, and leave it as a task for the user or a separate plugin that is dedicated to only cleaning html? There are some that exist, but not in Package Control - for example an old one for ST2...
Here is an example showing the "select entire element" working :)
I found a solution to my query history problem - I've added a new command to the command pallette which will show the query history. When you select an entry, it will show show the xpath query input with this query prefilled. Currently, it only shows queries used in the current document and forgets them when you close the document. What do you think, is that alright? I have tested, and it even remembers history if you restart sublime :)
And to make it easier to select all results from a query without having to change any preferences, I have added a command to re-run the last query and select all results. This means you can cycle through the query history, run the query and see the results, and then quickly re-run it and select the results. :+1:
I like the query history concept. Based on what you said about remembering the history for a document, that seems to imply that the history is not shared across documents. I think it might be more useful to have a shared history across all files, even if it is not persisted across sublime launches or system reboots. The advantages of this can be well-understood in a situation where you have several documents with the same structure, but different data.
I also noticed in your gif that you have mixed tabs/spaces. Is that something this plugin is screwing up, or was it just a copy/paste thing? If it's something from this plugin we should fix that.
The mixed tabs and spaces was a copy paste thing and I was too lazy to convert it, despite it being such a quick action in Sublime lol - well noticed Ross :) the plugin doesn't ever modify the document, only where the selection is :) though this will change once we add the prompt to clean html documents with BeautifulSoup :)
Good point about query history, I think I will make it an option whether to be global or for a specific file, and we can default it to global. I'm all about choice, and I can see some cases where people might find it annoying ;) I prefer too many options than not enough, hope that is alright by you?
Yeah it's fine. Just because I don't see the need doesn't mean others don't. And I like options, too.
I have now created the option to store the xpath query history globally or per file :)
Good news! It turns out that lxml has a html parser, and it can clean the document for us without us requiring BeautifulSoup as a dependency :) I have now implemented it the way you suggested @BrutalSimplicity: if it fails to parse the HTML, it notifies the user and gives them the option of transforming it :) Please could you do some more testing? :D
To make testing easier, I have created a release in my fork, which you can copy into Package Control's Installed Packages
folder, overwriting the existing file. On Windows, you can get there from Sublime Text by going to the Preferences menu -> Browse Packages, then up a level.
I've included a couple of functions from the XPath 2.0 specification, which might make our lives easier ;) namely upper-case
, lower-case
, ends-with
, matches
and tokenize
. I have also included print
, to make xpath expressions a bit easier to debug, by logging to the console. Maybe I should put print
in an st3
namespace prefix or something though, to make it clear that it isn't standard - what do you think?
Now has syntax coloring for xpath expressions :) it's not perfect, and may not play well with all color themes, but should make it more fun to play with XPath queries :)
That's incredible! Nice work!
On Fri, Dec 18, 2015, 03:52 Keith Hall notifications@github.com wrote:
Now has syntax coloring for xpath expressions :) it's not perfect, and may not play well with all color themes, but should make it more fun to play with XPath queries :) [image: xpath syntax color] https://cloud.githubusercontent.com/assets/11882719/11892941/3348e4d4-a575-11e5-9e50-657b759bc3f6.png
— Reply to this email directly or view it on GitHub https://github.com/rosshadden/sublime-xpath/issues/9#issuecomment-165716667 .
~Ross Hadden
Thanks Ross - I'm really proud of it :) in my opinion, it will be ready for release soon :)
this plugin helped me to answer this on stackoverflow :)
I've added auto-completion when entering XPath expressions, which trigger automatically when typing a /
or a [
character:
so far, it is very simple, and just shows static entries, namely axis specifiers and the text node type. But I am hoping to be able to extend it to automatically suggest available attributes on a node when typing @
and children tag names etc :)
:+1: :+1: :+1: :+1: Very cool
aaand done :) demonstration:
@keith-hall This is incredible. When you make a pull request please also add docs for it in the README. It would be cool if you added these gifs, too. Or at least the last one, which has the finished functionality.
This is insanely awesome. Bravo! Well done sir.
@rosshadden I've updated the readme on my branch to include this final gif as per your suggestion (thanks, by the way) - please could you take a look at it and tell me what you think... Maybe you have other ideas how to improve it? Having it right at the top would make it stand out, but I think it is more natural the way it is at the moment... https://github.com/keith-hall/sublime-xpath/tree/query_completions#autocomplete_demo
We could try to make a gif for each of the main features, which I consider to be:
I am not sure if it is worth showing the functionality to copy the xpath(s) of the nodes under the cursor, what do you think? have I missed anything important? maybe you know of some awesome screen -> gif making tool that we could use? :) my current workflow is less than desirable ;)
Edit: using the advice given here, I am going to attach the images to this post, so that we can link to them from the readme.
I think it's well-placed in the readme. Also I do think the 'right click -> copy xpath at cursor' workflow is worthy of at least a static screenshot. It is, after all, the original and probably still main point of this plugin :rabbit:
I don't know of a good workflow for making animated gifs, but I am sure there are good ones. I'll look around.
There seem to be a lot of solutions. One is this extension, which captures animated gifs but has poor reviews. There are more results including a tool I used to use for screenshots, Snagit. But another route is to just use a screen record tool, as videos can be converted to animated gif.
Thanks @rosshadden for the feedback and research :) I will get to work creating the animated gifs :) I think I will make a gif for the 'right click -> copy xpath at cursor' workflow, rather than just a static screenshot, to show what is copied to the clipboard by pasting it into a new file :)
All demos added to readme I had the idea that for people upgrading to the new version, we could show a message with a changelog to explain the new functionality. For new installs, I feel it wouldn't be relevant, as the user would probably have already read the readme before making the decision to install it. What do you think? Maybe we would just want the message to be similar to this: https://github.com/keith-hall/sublime-xpath/releases/tag/v0.2.0-beta6
I agree. Or even just link to those release notes. You should make a pull request with this stuff btw.
Implemented by #12.
This... is... SO HELPFUL!!!!!! I'm at a loss for words. I was not expecting this feature, and then this morning, BAM, there's the message from Package Control that says, "Hey, you can query by XPath now". No more taking the XML into Firefox and plodding through document.evaluate() results. YESSSSS!!!
This is going to save me so much time. THANK YOU thank you thank you. I skimmed the back-and-forth above, and clearly this wasn't an easy feature to add. You have one VERY pleasantly surprised end-user this morning.
Thanks for the feedback @dsc-allenk :) glad you like it!
you're right that it wasn't easy to implement, but that is partially my fault for not wanting to make do with just "the bare minimum" ;)
Great idea with the Package Control status, @keith-hall :-D. I didn't even know that was a thing.
Helo Keith and Ross
First of all; Ross congrads for developing sublime-xpath and Keith for maintaining/enhancing it. I sent an email to Ross, he said Keith was maintaining/enhancing it. And as I see, the ability to see results as you type was added by Keith.
It seems sublime-xpath is a pretty nice one for xpath.
Here are some questions; 1) Do you have a youtube video on how to use it? Animations are good, but not easy to follow, you can not stop them and exactly see what is happening. Also there is no voice explaining what is happening. So Animations are hard to use when you are trying to figure out how to use it. I think it would be really great to have a youtube video speaking and demoing how to use it. If you don't have one already, after I get a better handle on how to use it, maybe I should create one and put a link to it in README, what do you think ?
2) How do I start using it ? For example how do I get the "enter xpath" label with the line to enter xpath expressions? I guess, I type Ctrl + Shift + P then type XPath:Query document , is that so ? Ross said YES to this. So, I'm leaving this question here for anyone else having trouble later on.
3) If my user settings is empty, does it mean the default is being used ? Ross said YES to this. So, I'm leaving this question and answer for future reference to other users.
4) Looks like it only works with sublimetext 3 and does not work sublimetext 2, is that so? BTW, in sublimetext 3, xpath was not fully working (like dynamic results not poping up, etc). It all worked after I upgraded sublimetext 3 to latest as of today Jan 29 2017.
5) I believe, currently sublimetext-xpath supports ONLY v1, is that so? What about v2, v3? Any plans? What would be involved in supporting v2, v3?
Hey, I'm using your XPath tool in Sublime and it's working out great, so let me say thank you for saving me time and pain. After reviewing your code, I thought that it would also be cool/convenient to have an XPath query function that could try to use lxml's xpath finder for validating/locating XPaths. I understand this might have some difficulty with malformed xml/xhtml, but thought it would be useful in most cases. What do you think? If you're interested, but busy, it would be something that I wouldn't mind doing a pull-request on and adding the functionality in the next month or two.