Closed ikemeneghetti closed 2 years ago
Hi @ikemeneghetti, thank you for opening this issue, I will investigate on it. Here is the documentation about the XML reader : https://streamthoughts.github.io/kafka-connect-file-pulse/docs/developer-guide/file-readers/#xxxxmlfileinputreader
You can configure an XPATH using the property reader.xpath.expression
. Also, by default XML validation is not enabled.
Hi! Thanks for the quick response!
I've tried some xpath expressions, but the issue is that I can't get "myCDR" as the root of my record.
I will paste an excerpt of the records that I can read from the topic.
When I remove the DOCTYPE line manually and do not define a xpath expression:
{"myCDR":{"myCDR":{"cdrData":{"array":[{"io.confluent.connect.avro.CdrData":{"headerModule":{"io.confluent.connect.avro.HeaderModule":{"recordId":{
When I let the original xml and use a xpath expression:
{"cdrData":{"array":[{"CdrData":{"basicModule":null,"centrexModule":null,"headerModule":{"io.confluent.connect.avro.HeaderModule":{"recordId":
I tried the following expressions:
I validated these expressions with xmlllint and them seem to include "myCDR" in the output.
Hi @ikemeneghetti, a fix was push for this issue. You can try it by using the following docker image:
docker pull streamthoughts/kafka-connect-file-pulse:master
Hi @fhussonnois!
I just got a chance to test the update just now.
Worked perfectly!
I noticed that some WARNs appear in the log, but it doesn't seem to be a problem since the info came up in the topic normally. I'll put an excerpt of the log just to demonstrate, but the problem can already be considered solved.
[0m 2021-12-16 17:39:18,593 INFO [task-thread-connect-file-pulse-quickstart-xml-0] Started FilePulse source task (io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceTask)
[0m 2021-12-16 17:39:18,593 INFO [task-thread-connect-file-pulse-quickstart-xml-0] WorkerSourceTask{id=connect-file-pulse-quickstart-xml-0} Source task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSourceTask)
[0m 2021-12-16 17:39:18,598 INFO [task-thread-connect-file-pulse-quickstart-xml-0] WorkerSourceTask{id=connect-file-pulse-quickstart-xml-0} Executing source task (org.apache.kafka.connect.runtime.WorkerSourceTask)
[0m 2021-12-16 17:39:18,601 INFO [kafka-producer-network-thread | connector-producer-connect-file-pulse-quickstart-xml-0] [Producer clientId=connector-producer-connect-file-pulse-quickstart-xml-0] Cluster ID: GD8uTB8GQECRME_IRvRnIw (org.apache.kafka.clients.Metadata)
[0m 2021-12-16 17:39:18,617 INFO [task-thread-connect-file-pulse-quickstart-xml-0] Opening new iterator for: file:/tmp/kafka-connect/examples/file_1.xml (io.streamthoughts.kafka.connect.filepulse.source.DelegateFileInputIterator)
[0m 2021-12-16 17:39:18,620 WARN [task-thread-connect-file-pulse-quickstart-xml-0] Handled XML parser error on file file:/tmp/kafka-connect/examples/file_1.xml. Error: Element type "myCDR" must be declared. (io.streamthoughts.kafka.connect.filepulse.fs.reader.xml.XMLFileInputIterator)
[0m 2021-12-16 17:39:18,620 WARN [task-thread-connect-file-pulse-quickstart-xml-0] Handled XML parser error on file file:/tmp/kafka-connect/examples/file_1.xml. Error: Element type "cdrData" must be declared. (io.streamthoughts.kafka.connect.filepulse.fs.reader.xml.XMLFileInputIterator)
[0m 2021-12-16 17:39:18,620 WARN [task-thread-connect-file-pulse-quickstart-xml-0] Handled XML parser error on file file:/tmp/kafka-connect/examples/file_1.xml. Error: Element type "headerModule" must be declared. (io.streamthoughts.kafka.connect.filepulse.fs.reader.xml.XMLFileInputIterator)
[0m 2021-12-16 17:39:18,620 WARN [task-thread-connect-file-pulse-quickstart-xml-0] Handled XML parser error on file file:/tmp/kafka-connect/examples/file_1.xml. Error: Element type "recordId" must be declared. (io.streamthoughts.kafka.connect.filepulse.fs.reader.xml.XMLFileInputIterator)
[0m 2021-12-16 17:39:18,620 WARN [task-thread-connect-file-pulse-quickstart-xml-0] Handled XML parser error on file file:/tmp/kafka-connect/examples/file_1.xml. Error: Element type "info1" must be declared. (io.streamthoughts.kafka.connect.filepulse.fs.reader.xml.XMLFileInputIterator)
[0m 2021-12-16 17:39:18,620 WARN [task-thread-connect-file-pulse-quickstart-xml-0] Handled XML parser error on file file:/tmp/kafka-connect/examples/file_1.xml. Error: Element type "info2" must be declared. (io.streamthoughts.kafka.connect.filepulse.fs.reader.xml.XMLFileInputIterator)
[0m 2021-12-16 17:39:18,620 WARN [task-thread-connect-file-pulse-quickstart-xml-0] Handled XML parser error on file file:/tmp/kafka-connect/examples/file_1.xml. Error: Element type "info3" must be declared. (io.streamthoughts.kafka.connect.filepulse.fs.reader.xml.XMLFileInputIterator)
[0m 2021-12-16 17:39:18,620 WARN [task-thread-connect-file-pulse-quickstart-xml-0] Handled XML parser error on file file:/tmp/kafka-connect/examples/file_1.xml. Error: Element type "info4" must be declared. (io.streamthoughts.kafka.connect.filepulse.fs.reader.xml.XMLFileInputIterator)
[0m 2021-12-16 17:39:18,620 WARN [task-thread-connect-file-pulse-quickstart-xml-0] Handled XML parser error on file file:/tmp/kafka-connect/examples/file_1.xml. Error: Element type "type" must be declared. (io.streamthoughts.kafka.connect.filepulse.fs.reader.xml.XMLFileInputIterator)
[0m 2021-12-16 17:39:18,662 INFO [task-thread-connect-file-pulse-quickstart-xml-0] Closed iterator for: [uri=file:/tmp/kafka-connect/examples/file_1.xml, name='file_1.xml', contentLength=302, lastModified=1639676337000, contentDigest=[digest=2787635386, algorithm='CRC32'], userDefinedMetadata={system.inode=67563658, system.hostname=786e7e695127}] (io.streamthoughts.kafka.connect.filepulse.source.DelegateFileInputIterator)
[0m 2021-12-16 17:39:18,662 INFO [task-thread-connect-file-pulse-quickstart-xml-0] Completed all object files. FilePulse source task is transitioning to IDLE state while waiting for new reconfiguration request from source connector. (io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceTask)
[0m 2021-12-16 17:39:22,713 INFO [FileSystemMonitorThread] Cleaning up completed object files '1' (io.streamthoughts.kafka.connect.filepulse.fs.DefaultFileSystemMonitor)
[0m 2021-12-16 17:39:22,716 INFO [FileSystemMonitorThread] Finished cleaning all completed object files (io.streamthoughts.kafka.connect.filepulse.fs.DefaultFileSystemMonitor)
[0m 2021-12-16 17:39:22,716 INFO [FileSystemMonitorThread] Scheduled files still being processed: 1. Skip filesystem listing while waiting for tasks completion (io.streamthoughts.kafka.connect.filepulse.fs.DefaultFileSystemMonitor)
[0m 2021-12-16 17:39:22,716 INFO [FileSystemMonitorThread] Completed filesystem monitoring iteration in 3 ms (io.streamthoughts.kafka.connect.filepulse.source.FileSystemMonitorThread)
[0m 2021-12-16 17:39:27,821 INFO [FileSystemMonitorThread] Starting to list object files using: LocalFSDirectoryListing (io.streamthoughts.kafka.connect.filepulse.fs.DefaultFileSystemMonitor)
[0m 2021-12-16 17:39:27,821 INFO [FileSystemMonitorThread] Completed object files listing. '0' object files found in 0ms (io.streamthoughts.kafka.connect.filepulse.fs.DefaultFileSystemMonitor)
[0m 2021-12-16 17:39:27,822 INFO [FileSystemMonitorThread] Finished lookup for new object files: '0' files can be scheduled for processing (io.streamthoughts.kafka.connect.filepulse.fs.DefaultFileSystemMonitor)
Thanks!
Hi! First of all, thanks for maintaining this awesome plugin.
I'm getting "Unsupported node type 10" error when I try to read xml records in the following format:
Error output:
As expected, when I remove the DOCTYPE line, the record is processed normally. I would like to be able to process these files without having to remove the DOCTYPE. I don't need to do any xml validation so I can just ignore it. I couldn't find a way to ignore this attribute or extract what I need via an xpath expression.
Here are some of the settings I am using for this connector:
Thanks in advance!