mudge / riveted

A Clojure library for the fast processing of XML with VTD-XML.
https://clojars.org/riveted
27 stars 2 forks source link

Support API that allows using byte arrays or different buffer types instead of string? #6

Closed ieugen closed 2 years ago

ieugen commented 2 years ago

We are using this library and I noticed uses the String arg and converts to byte array. I believe it should also provide a byte array + charset API.

One use case is for example we use LMDB and going from bytes > string -> bytes is not efficient.

defn navigator
  ([^String xml] (navigator xml false))
  ([^String xml namespace-aware]
mudge commented 2 years ago

Do you have a proposal for how to add this to the existing navigator function without breaking compatibility? We'd need to switch on the class of xml but I'm not sure what the most idiomatic way to do that is: multimethods? Protocols?

mudge commented 2 years ago

How about something like this?

(defmulti navigator
  (fn
    ([xml] (class xml))
    ([xml namespace-aware] (class xml))))

(defmethod navigator java.lang.String
  ([^String xml] (navigator xml false))
  ([^String xml namespace-aware]
   (let [vg (doto (VTDGen.) (.setDoc (.getBytes xml "UTF-8"))
                  (.parse namespace-aware))]
     (Navigator. (.getNav vg)))))

(defmethod navigator (Class/forName "[B")
  ([^"[B" xml] (navigator xml false))
  ([^"[B" xml namespace-aware]
   (let [vg (doto (VTDGen.) (.setDoc xml)
                  (.parse namespace-aware))]
     (Navigator. (.getNav vg)))))
mudge commented 2 years ago

And here's an implementation using protocols instead:

(defprotocol Navigable
  (navigator [xml] [xml namespace-aware]))

(extend-protocol Navigable
  java.lang.String
  (navigator
    ([^String xml] (navigator xml false))
    ([^String xml namespace-aware]
     (let [vg (doto (VTDGen.) (.setDoc (.getBytes xml "UTF-8"))
                    (.parse namespace-aware))]
       (Navigator. (.getNav vg))))))

(extend-protocol Navigable
  (Class/forName "[B")
  (navigator
    ([^"[B" xml] (navigator xml false))
    ([^"[B" xml namespace-aware]
     (let [vg (doto (VTDGen.) (.setDoc xml)
                    (.parse namespace-aware))]
       (Navigator. (.getNav vg))))))

Are there types other than java.lang.String and [B (Java's native byte array) we'd need to implement this for?

ieugen commented 2 years ago

Hi,

Thanks for the positive feedback. I think String and byte array are good enough.

IMO String variant should allow to set other charsets than UTF-8. Or we could consider people should use the byte version if they have strings with other encodings.

I thought that InputStream might also be a candidate - but saw that VTDGen can't use streams :( .

Thanks,

mudge commented 2 years ago

It might be best to encourage people to pass a byte array to navigator and leave the UTF-8 String option as a convenience. That way, people can pass strings with other encodings by passing (.getBytes my-string "ISO-8859-1") and we keep this a separate concern?

I'm leaning towards the protocol implementation as I can see a future where you could, say, pass a file handler, etc.

ieugen commented 2 years ago

Yes, I think protocol implementation is good. I would add the recommendations in as function descriptions / help.

Thanks for providing deps.edn, I think it's going to be very useful.

mudge commented 2 years ago

I've now raised draft PR #9 with this change. Would you please be able to test it with SHA da2f0bb345adb65908aed35cbcef5a393c453b8b and see if that works as expected?

mudge commented 2 years ago

I've just released version 0.2.0 of riveted with this new API:

; Generate a navigator for XML in my-byte-array without namespace support
(navigator my-byte-array)

; Generate a navigator for XML in my-byte-array with namespace support
(navigator my-byte-array true)