Parsing problem - Githubissues

conker84 commented 5 years ago

Hi, we have this issue in APOC: https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/1286 in particular this comment: https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/1286#issuecomment-530538305

So I tried to reproduce the problem and I exported the dataset into this file with the apoc.export.cypher.all procedure.

But you can use this file that contains only the "bad" line that I extracted from the file above

And if I execute this:

cat import/twitter.cypher | ./bin/cypher-shell --non-interactive

I get this error:

Invalid input ':': expected <init> (line 2, column 1 (offset: 1)) <line omitted>

So I created these two tests looking for an invalid line:


    @Test
    public void testParsingTwitterFileWithScanner() throws IOException {
        ObjectMapper mapper = new ObjectMapper();
        mapper.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true);
        String prefix = ":param rows => ";
        int lineNo = 0;
        try (Scanner scan = new Scanner(this.getClass().getClassLoader().getResourceAsStream("twitter.cypher"), "UTF-8")) {
            while (scan.hasNext()) {
                ++lineNo;
                String line = scan.nextLine();
                if (!line.startsWith(prefix)) {
                    continue;
                }
                line = line.substring(prefix.length());
                try {
                    mapper.readValue(line, Object.class);
                } catch (Exception e) {
                    System.err.println("Scanner: " + lineNo);
                    System.err.println("Scanner: " + line);
                }
            }
        }
    }


    @Test
    public void testParsingTwitterFileWithFileUtils() throws IOException {
        ObjectMapper mapper = new ObjectMapper();
        mapper.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true);
        String prefix = ":param rows => ";
        List<String> lines = IOUtils.readLines(this.getClass().getClassLoader().getResourceAsStream("twitter.cypher"), Charset.forName("UTF-8"));
        lines.forEach(line -> {
            if (!line.startsWith(prefix)) {
                return;
            }
            line = line.substring(prefix.length());
            try {
                mapper.readValue(line, Object.class);
            } catch (Exception e) {
                System.err.println("IOUtils: " + line);
            }
        });
    }

And while the testParsingTwitterFileWithScanner fails 7 times, the testParsingTwitterFileWithFileUtils works well.

So I looked into the Scanner class into the readLine method, and I found that it uses this pattern to get a line "\r\n|[\n\r\u2028\u2029\u0085]", and if I open the file with Sublime, and look for the first line that breaks after the last words reported we found 0x85 which should be the u0085 used by the Scanner line pattern.

I don't know if you under the hood use the Scanner class, but I hope that I provided enough info to understand where the problem is.

eastlondoner commented 5 years ago

@conker84 are you sure it's not that the cypher export function should be escaping the U+0085 character in some way ? It seems like it's a valid 'new line' expression https://www.compart.com/en/unicode/U+0085

conker84 commented 5 years ago

I thought about that and it's something that we can do as a workaround because I think that the correct behaviour is the one provided testParsingTwitterFileWithFileUtils, so there is a method to correctly parse the string.

neo4j / cypher-shell

Parsing problem #169