Parse Given TextReader - Githubissues

nickbabcock commented 10 years ago

Currently the parser will only 'work' if given a brand new stream, which it will construct its own StreamReader around it.

Here is an example of what can go wrong in a client program.

using (var fs = new FileStream( /*...*/ ))
using (var sr = new StreamReader(fs, /* ... */))
{
    sr.ReadLine();
    ParadoxParser.Parse(fs, /* ... */)
}

Since StreamReader is buffered, it will consume a lot more of the underlying stream. Therefore if fs passed to Parse it will start reading a lot farther in advance of the first line. This is not desired. The client should have the option of passing in a buffered text reader. The one problem with this is that we don't have control over setting the encoding of the text reader, which may be a problem.

Measter commented 10 years ago

You could save the position of the stream, reset it to 0, parse the file, then restore the position. You would not be able to dispose or close your StreamReader, though.

nickbabcock commented 10 years ago

Good point. I did a quick run of this earlier and it didn't work, but you inspired me to take another look and I think I may have found the trick.

using (var fs = new FileStream( /*...*/ ))
using (var sr = new StreamReader(fs, /* ... */))
{
    string line = sr.ReadLine();

    // Because paradox files are encoded with windows code page (1252) 
    // the number of bytes read != number of characters read.
    // Can't use line.Length as that is the number of characters and not bytes.
    int count = Encoding.GetEncoding(1252).GetByteCount(line)

    fs.Seek(count, SeekOrigin.Begin)
    ParadoxParser.Parse(fs, /* ... */)
}

Problems

What if they don't encode their TextReader with the windows code page? They could just use the length of the line they read as the number of bytes, but they run the risk of mis-reading text.
Putting this logic in the parser would make it simpler on the client, but it could be somewhat code-bloat.

Definitely something to think about.

EDIT: just realized you were talking about something slightly different and making two passes at the file.

Measter commented 10 years ago

You can get the position directly from the stream object. Like so:

using (var fs = new FileStream( /*...*/ ))
using (var sr = new StreamReader(fs, /*...*/))
{
    string line = sr.ReadLine();

    // Store current position.
    long pos = fs.Position;

    // Move to beginning of stream.
    fs.Position = 0;
    ParadoxParser.Parse(fs, /*...*/);

    // Reset position.
    fs.Position = pos;
}

You need to make sure the stream supports it by checking the CanSeek property.

nickbabcock commented 10 years ago

I feel like you might be really close to something, let me clarify with an example:

EU4 savegames now have a first line of EU4txt followed by the traditional structure of the same. The problem is that the EU4txt doesn't correspond to any defined structures in the parser. It is, essentially, a special first line. Thus I want to read the first line (maybe do some checking on it) and then start the parser on the next line. The problem is that fs.Position will return the next buffered size block of the stream reader.

For instance on the previous example the following code:

Console.WriteLine(fs.Position)
Console.WriteLine(sr.ReadLine())
Console.WriteLine(fs.Position)

will print:

0
EU4txt
4096

Obviously, we don't want to start the parser on byte 4096 (the default buffer size for a StreamReader), but rather on byte 7 or 8.

Any ideas?

Measter commented 10 years ago

Well, this code is rather hacky, but it does work:

using (var fs = new FileStream( /*...*/ ))
using (var sr = new StreamReader(fs, /*...*/))
{
    string line = sr.ReadLine();

    // Store current position. Note: CurrentEncoding only works after reading.
    long pos = sr.CurrentEncoding.GetByteCount(line);

    // Move to end of read line.
    fs.Position = pos;

    // Read bytes until it's not a new line character.
    int nextChar;
    do
    {
        nextChar = fs.ReadByte();
    } while (nextChar == '\r' || nextChar == '\n');

    // Move back 1 character.
    fs.Position--;

    ParadoxParser.Parse(fs, /*...*/);
}

If the parser can handle starting with an empty line, then you don't need to do the do-while loop or subtract from fs.Position.

nickbabcock commented 10 years ago

In general the parser is really robust and so it will handle empty lines. In fact, the parser detects that the first line is EU4txt but the applications of what we are discussing is more far reaching.

The code example you showed, boils down to what I showed in https://github.com/nickbabcock/Pdoxcl2Sharp/issues/17#issuecomment-32061151, the only difference being fs.Position = pos vs fs.Seek(pos, SeekOrigin.Begin)

But it looks like the solution thus far is to push this issue out to whoever is using the parser, and to not try and support reading from TextReader in the parser, am I correct in your thoughts?

Measter commented 10 years ago

Given the issue with encoding you are correct, I would say that only support streams would be better.

nickbabcock / Pdoxcl2Sharp

Parse Given TextReader #17