simonkimi / xpath_selector

An XPath selector for locating Html and Xml elements
BSD 3-Clause "New" or "Revised" License
20 stars 4 forks source link

How to improve speed of field extraction? #3

Open bubnenkoff opened 2 years ago

bubnenkoff commented 2 years ago

I need to extract value (any from value list) from XML. First founded value. The follow code works is very slow:

import 'dart:io';

import 'package:xpath_selector/xpath_selector.dart';

void main(List<String> arguments) async {

  String file_name = r'D:\code\flutter\test_xpath\fksNotificationEA44_0373100047421000653_27036656.xml';

  final fileContent = File(file_name).readAsStringSync();

  List<String> purchaseNumber_xPathRulesList = [
                                                "purchaseNumber", 
                                                "notificationNumber",
                                                ];

  var res = extractValue(fileContent, purchaseNumber_xPathRulesList);
  print(res);
}

  String? extractValue(String fileContent, List<String> variantList) {

    String? result;
      for(var el in variantList) {        
        final elements = XPath.xml(fileContent).query('//*[local-name()="$el"]');
        if(elements.nodes.isNotEmpty) {
          result = elements.nodes.first.text;
          break;
        }
      }

    return result; 
 }

How can I improve speed?

fksNotificationEA44_0373100047421000653_27036656.zip

simonkimi commented 2 years ago

When the descendant selector ` / / 'is placed first, the program will scan the whole document, which will consume a lot of time Please minimize the scanning range.

I'm thinking about optimizing performance. Before that, please use native for complex queries.

bubnenkoff commented 2 years ago

@simonkimi how I can minimize scan range? I know about my data only that they are placed at top 20 lines of xml. Parents can have different names.

Example above is take too long time for data extraction...

simonkimi commented 2 years ago

So please use native queries for now, and I will optimize them in the near future

bubnenkoff commented 2 years ago

What is native "queries"? Something like:

XPath.xml(html).query('//*[local-name()="$el"]');

?

simonkimi commented 2 years ago

This package uses xml for parsing. You can use this library directly

bubnenkoff commented 2 years ago

@simonkimi any updates?

simonkimi commented 2 years ago

Sorry, there are no updates. I think this is due to a structural flaw in the library, which makes it difficult to speed up without refactoring. But because of work, I don't have time to maintain it.

simonkimi commented 2 years ago

In fact, I was also troubled by speed, so I used another method: use golang's xpath library, compile it into so or dll files, and call it with ffi, it works fine for me.

bubnenkoff commented 2 years ago

Could you provide code examples?

вт, 16 авг. 2022 г. в 12:35, GugeFramework @.***>:

In fact, I was also troubled by speed, so I used another method: use golang's xpath library, compile it into so or dll files, and call it with ffi, it works fine for me.

— Reply to this email directly, view it on GitHub https://github.com/simonkimi/xpath_selector/issues/3#issuecomment-1216398391, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRWNFRM6T6MYXXJNIOU52LVZNOHTANCNFSM5L3NOH4Q . You are receiving this because you authored the thread.Message ID: @.***>

simonkimi commented 2 years ago

https://github.com/simonkimi/catweb-parser I use protobuf to pass html and selectors(css or xpath) to golang, and then pass the results back. However, this project is customized for another project, so it is not universal.