mperdeck / LINQtoCSV

Popular, easy to use library to read and write CSV files.
197 stars 112 forks source link

Rewinding stream in CsvContext.ReadData has to take into account preamble (BOM) #13

Open adolfogg opened 10 years ago

adolfogg commented 10 years ago

Hi Matt, I think there might be a bug in how the stream is rewinded in the ReadData method of CsvContext. The actual code does not take into account the length of a possible BOM in a UTF-8 or Unicode encoded stream, thus those 2, 3 or 4 bytes of the BOM are getting read as part of the first element.

If the stream has column names in the first line it would raise an exception because that first column name (with the BOM prepended) will not be found in T.

I think you could fix it by changing line 126 in CsvContext.cs:

stream.BaseStream.Seek(0, SeekOrigin.Begin);

With this code:

// Skip Unicode preamble/BOM (Byte Order Mark) if present.
stream.Peek();
var bytes = new byte[stream.CurrentEncoding.GetPreamble().Length];
stream.BaseStream.Seek(0, SeekOrigin.Begin);
stream.BaseStream.Read(bytes, 0, stream.CurrentEncoding.GetPreamble().Length);
if (!bytes.SequenceEqual(stream.CurrentEncoding.GetPreamble()))
{
    stream.BaseStream.Seek(0, SeekOrigin.Begin);
}
stream.DiscardBufferedData();

NOTES:

  1. stream.Peek() forces the inspection of the stream thus setting the correct stream.CurrentEncoding value if detectEncodingFromByteOrderMarks is set to true in the constructor of the stream.
  2. stream.DiscardBufferedData() resets the internal buffer of the stream so that it can be safely read again.

Best regards.

mperdeck commented 10 years ago

Hi adolfogg,

Thanks for the bug report. Do you have code that reproduces the problem?

This allows me to 1) confirm the issue; and 2) create a unit test to avoid regression.

Thanks,

Matt

adolfogg commented 10 years ago

Hi Matt, I think I have narrowed the problem. It happens when you enumerate more than once the IEnumerable<T> returned by CsvContext.Read when using a stream (it doesn't happen when using a fileName).

Here is the code to reproduce the problem:

@using LINQtoCSV;
@using System.Text;
@{
    var stringBuilder = new StringBuilder();
    stringBuilder.AppendLine("Column1;Column2");
    stringBuilder.AppendLine("Elementó1-1;Elementó1-2");
    stringBuilder.AppendLine("Elementó2-1;Elementó2-2");

    // Create the underlying MemoryStream.
    using (var memoryStream = new MemoryStream())
    {
        var encoding = new UTF8Encoding(true);

        // Write preamble/BOM.
        var preambleBytes = encoding.GetPreamble();
        memoryStream.Write(preambleBytes, 0, preambleBytes.Length);

        // Write StringBuilder.
        var stringBuilderBytes = encoding.GetBytes(stringBuilder.ToString());
        memoryStream.Write(stringBuilderBytes, 0, stringBuilderBytes.Length);

        // Create the StreamReader with Unicode encoding detection from BOM and default ASNI encoding if no detection made.
        using (var streamReader = new StreamReader(memoryStream, Encoding.Default, true))
        {
            var csvElements = new CsvContext().Read<CsvElement>(streamReader, new CsvFileDescription { EnforceCsvColumnAttribute = true, FirstLineHasColumnNames = true, MaximumNbrExceptions = -1, SeparatorChar = ';' });
            csvElements.ToList();
            csvElements.ToList(); // This second enumeration raises a NameNotInTypeException.
        }
    }
}
@functions
{
    protected class CsvElement
    {
        [CsvColumn(FieldIndex = 0)]
        public string Column1 { get; set; }

        [CsvColumn(FieldIndex = 1)]
        public string Column2 { get; set; }
    }
}

Please note that although the code to skip the preamble/BOM that I posted yesterday seems works when enumerating more than once, I am not sure if it is the correct solution because, if you only enumerate once, your original code without skipping the preamble/BOM also works. Maybe you have to look somewhere else.

Hope this helps. Best Regards.