Closed andrei-ionescu closed 2 years ago
@jorgecarleitao: I don't think that all the cases are covered in the current arrow2
implementation.
I would reopen the previous #3892 ticket but I cannot.
cc: @ritchie46
@ritchie46, @jorgecarleitao: Any ETA on having this fix pulled from arrow2 into here?
It already is.
Let me pull the master and try the test again.
@jorgecarleitao, I did run some tests and I did find another case with OutOfSpec
error. Here is the error:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value:
OutOfSpec("The children DataTypes of a StructArray must equal the children data
types.\n However, the values at index 1 have a length of 114072,
which is different from values at index 0, 630.")',
/.../.cargo/git/checkouts/arrow2-945af624853845da/da64106/src/array/struct_/mod.rs:118:52
stack backtrace:
0: rust_begin_unwind
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/core/src/panicking.rs:142:14
2: core::result::unwrap_failed
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/core/src/result.rs:1805:5
3: arrow2::array::struct_::StructArray::new
4: polars_core::chunked_array::logical::struct_::StructChunked::update_chunks
5: polars_core::series::implementations::struct_::<impl polars_core::series::series_trait::SeriesTrait
for polars_core::series::implementations::SeriesWrap<polars_core::
chunked_array::logical::struct_::StructChunked>>::append
6: polars_core::series::implementations::struct_::<impl polars_core::series::series_trait::SeriesTrait
for polars_core::series::implementations::SeriesWrap<polars_core::
chunked_array::logical::struct_::StructChunked>>::append
7: polars_core::series::implementations::struct_::<impl polars_core::series::series_trait::SeriesTrait
for polars_core::series::implementations::SeriesWrap<polars_core::
chunked_array::logical::struct_::StructChunked>>::append
...
Maybe can help in any way until I'll be able to create a slim parquet file. The current file that produces this error is about 110Mb
.
@ritchie46, @jorgecarleitao: I managed to print out the conflicting data structures. This is how they are looking...
Values at index 0
:
LargeUtf8Array[3490050010715265545, 2061035645983490919, 8001251476546823717, ...]
Values at index 1
:
StructArray[{code: 3245164418740504690}, {code: 3245164418740504690}, ...]
The first line (the one with index 0
) contains 630
strings formed out of 19 digits.
The second line contains code: 3245164418740504690
for 114072
times.
The fields are:
[
Field {
name: "id", data_type: LargeUtf8, is_nullable: true, metadata: {}
},
Field {
name: "namespace", data_type: Struct(
[
Field {
name: "code", data_type: LargeUtf8, is_nullable: true, metadata: {}
}
]
), is_nullable: true, metadata: {}
},
Field {
name: "primary", data_type: Boolean, is_nullable: true, metadata: {}
}
]
I don't think is the culprit is the data because there is no issue in Spark.
I think, there is an issue with the 114072
times that code there. That should not look like that.
Hey @andrei-ionescu . Thanks again for the patience and for the report - it is very useful š. Sorry for the late reply, I am on vacations with limited access to internet.
Just to make sure I understood the last comment: "index 0" and "index 1" represent the column index, "line" represents the row number, and the issue is that the columns have a different number of rows.
Are you able to create a (mock) file with e.g. pandas or pyarrow that reproduces the problem?
@jorgecarleitao: Here is the file ā part-00003-a422a23f-e65a-4cab-9bd0-6e877a8f7337-c000.snappy.parquet.zip ā about 72Mb
zipped, 117Mb
parquet. I could not make it any slimmer.
@jorgecarleitao, @ritchie46: Is this cherry picked in polars?
@jorgecarleitao, @ritchie46: I just tried latest arrow2 + latest polars (both straight from the git repo) + the file above and I still see the same OutOfSpec
error.
Am I missing something?
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value:
OutOfSpec("The children DataTypes of a StructArray must equal the children data
types.\n However, the values at index 1 have a length of 114072,
which is different from values at index 0, 630.")',
/.../arrow2/src/array/struct_/mod.rs:118:52
stack backtrace:
0: rust_begin_unwind
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/core/src/panicking.rs:142:14
2: core::result::unwrap_failed
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/core/src/result.rs:1805:5
3: core::result::Result<T,E>::unwrap
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/core/src/result.rs:1098:23
4: arrow2::array::struct_::StructArray::new
at /.../arrow2/src/array/struct_/mod.rs:118:9
5: polars_core::chunked_array::logical::struct_::StructChunked::update_chunks
at /.../polars/polars/polars-core/src/chunked_array/logical/struct_/mod.rs:76:32
6: polars_core::series::implementations::struct_::<impl polars_core::series::series_trait::SeriesTrait for polars_core::series::implementations::SeriesWrap<polars_core::chunked_array::logical::struct_::StructChunked>>::append
at /.../polars/polars/polars-core/src/series/implementations/struct_.rs:128:9
7: polars_core::series::Series::append
at /.../polars/polars/polars-core/src/series/mod.rs:210:9
8: polars_core::series::implementations::struct_::<impl polars_core::series::series_trait::SeriesTrait for polars_core::series::implementations::SeriesWrap<polars_core::chunked_array::logical::struct_::StructChunked>>::append
at /.../polars/polars/polars-core/src/series/implementations/struct_.rs:126:13
9: polars_core::series::Series::append
at /.../polars/polars/polars-core/src/series/mod.rs:210:9
10: polars_core::series::implementations::struct_::<impl polars_core::series::series_trait::SeriesTrait for polars_core::series::implementations::SeriesWrap<polars_core::chunked_array::logical::struct_::StructChunked>>::append
at /.../polars/polars/polars-core/src/series/implementations/struct_.rs:126:13
11: polars_core::series::Series::append
at /.../polars/polars/polars-core/src/series/mod.rs:210:9
12: polars_core::frame::DataFrame::vstack_mut::{{closure}}
at /.../polars/polars/polars-core/src/frame/mod.rs:908:17
13: core::iter::traits::iterator::Iterator::try_for_each::call::{{closure}}
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/core/src/iter/traits/iterator.rs:2296:26
14: core::iter::traits::iterator::Iterator::try_fold
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/core/src/iter/traits/iterator.rs:2238:21
15: core::iter::traits::iterator::Iterator::try_for_each
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/core/src/iter/traits/iterator.rs:2299:9
16: polars_core::frame::DataFrame::vstack_mut
at /.../polars/polars/polars-core/src/frame/mod.rs:903:9
17: polars_core::utils::accumulate_dataframes_vertical
at /.../polars/polars/polars-core/src/utils/mod.rs:813:9
18: polars_io::parquet::read_impl::read_parquet
at /.../polars/polars/polars-io/src/parquet/read_impl.rs:289:22
19: polars_io::parquet::read::ParquetReader<R>::_finish_with_scan_ops
at /.../polars/polars/polars-io/src/parquet/read.rs:61:9
20: polars_lazy::physical_plan::executors::scan::parquet::ParquetExec::read
at /.../polars/polars/polars-lazy/src/physical_plan/executors/scan/parquet.rs:39:9
21: <polars_lazy::physical_plan::executors::scan::parquet::ParquetExec as polars_lazy::physical_plan::Executor>::execute::{{closure}}
at /.../polars/polars/polars-lazy/src/physical_plan/executors/scan/parquet.rs:62:68
22: polars_lazy::physical_plan::file_cache::FileCache::read
at /.../polars/polars/polars-lazy/src/physical_plan/file_cache.rs:40:13
23: <polars_lazy::physical_plan::executors::scan::parquet::ParquetExec as polars_lazy::physical_plan::Executor>::execute
at /.../polars/polars/polars-lazy/src/physical_plan/executors/scan/parquet.rs:60:9
24: <polars_lazy::physical_plan::executors::udf::UdfExec as polars_lazy::physical_plan::Executor>::execute
at /.../polars/polars/polars-lazy/src/physical_plan/executors/udf.rs:12:18
25: polars_lazy::frame::LazyFrame::collect
at /.../polars/polars/polars-lazy/src/frame/mod.rs:720:19
26: gyrfalcon::main
at ./src/main.rs:21:14
27: core::ops::function::FnOnce::call_once
at /rustc/7b46aa594c4bdc507fbd904b6777ca30c37a9209/library/core/src/ops/function.rs:248:5
No, that was on me. The fix was insufficient - I believe https://github.com/jorgecarleitao/arrow2/pull/1188 fixes this. Your file is a really good fuzzy test.
@jorgecarleitao: I'm glad that it's helpful.
@jorgecarleitao, I just tested/checked the code changes you merged with the https://github.com/jorgecarleitao/arrow2/pull/1188 and I can still see the issue. I also can validate that the error message now is the new one you changed in the PR: The children must have an equal number of values
.
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value:
OutOfSpec("The children must have an equal number of values.\n
However, the values at index 1 have a length of 114072, which is different
from values at index 0, 630.")',
/.../arrow2/src/array/struct_/mod.rs:118:52
stack backtrace:
0: rust_begin_unwind
at /rustc/6dbae3ad19309bb541d9e76638e6aa4b5449f29a/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/6dbae3ad19309bb541d9e76638e6aa4b5449f29a/library/core/src/panicking.rs:142:14
2: core::result::unwrap_failed
at /rustc/6dbae3ad19309bb541d9e76638e6aa4b5449f29a/library/core/src/result.rs:1814:5
3: core::result::Result<T,E>::unwrap
at /rustc/6dbae3ad19309bb541d9e76638e6aa4b5449f29a/library/core/src/result.rs:1107:23
4: arrow2::array::struct_::StructArray::new
at /.../arrow2/src/array/struct_/mod.rs:118:9
5: polars_core::chunked_array::logical::struct_::StructChunked::update_chunks
at /.../polars/polars/polars-core/src/chunked_array/logical/struct_/mod.rs:76:32
6: polars_core::series::implementations::struct_::<impl polars_core::series::series_trait::SeriesTrait for polars_core::series::implementations::SeriesWrap<polars_core::chunked_array::logical::struct_::StructChunked>>::append
at /.../polars/polars/polars-core/src/series/implementations/struct_.rs:128:9
7: polars_core::series::Series::append
at /.../polars/polars/polars-core/src/series/mod.rs:210:9
8: polars_core::series::implementations::struct_::<impl polars_core::series::series_trait::SeriesTrait for polars_core::series::implementations::SeriesWrap<polars_core::chunked_array::logical::struct_::StructChunked>>::append
at /.../polars/polars/polars-core/src/series/implementations/struct_.rs:126:13
9: polars_core::series::Series::append
at /.../polars/polars/polars-core/src/series/mod.rs:210:9
10: polars_core::series::implementations::struct_::<impl polars_core::series::series_trait::SeriesTrait for polars_core::series::implementations::SeriesWrap<polars_core::chunked_array::logical::struct_::StructChunked>>::append
at /.../polars/polars/polars-core/src/series/implementations/struct_.rs:126:13
11: polars_core::series::Series::append
at /.../polars/polars/polars-core/src/series/mod.rs:210:9
12: polars_core::frame::DataFrame::vstack_mut::{{closure}}
at /.../polars/polars/polars-core/src/frame/mod.rs:908:17
13: core::iter::traits::iterator::Iterator::try_for_each::call::{{closure}}
at /rustc/6dbae3ad19309bb541d9e76638e6aa4b5449f29a/library/core/src/iter/traits/iterator.rs:2296:26
14: core::iter::traits::iterator::Iterator::try_fold
at /rustc/6dbae3ad19309bb541d9e76638e6aa4b5449f29a/library/core/src/iter/traits/iterator.rs:2238:21
15: core::iter::traits::iterator::Iterator::try_for_each
at /rustc/6dbae3ad19309bb541d9e76638e6aa4b5449f29a/library/core/src/iter/traits/iterator.rs:2299:9
16: polars_core::frame::DataFrame::vstack_mut
at /.../polars/polars/polars-core/src/frame/mod.rs:903:9
17: polars_core::utils::accumulate_dataframes_vertical
at /.../polars/polars/polars-core/src/utils/mod.rs:813:9
18: polars_io::parquet::read_impl::read_parquet
at /.../polars/polars/polars-io/src/parquet/read_impl.rs:289:22
19: polars_io::parquet::read::ParquetReader<R>::_finish_with_scan_ops
at /.../polars/polars/polars-io/src/parquet/read.rs:61:9
20: polars_lazy::physical_plan::executors::scan::parquet::ParquetExec::read
at /.../polars/polars/polars-lazy/src/physical_plan/executors/scan/parquet.rs:39:9
21: <polars_lazy::physical_plan::executors::scan::parquet::ParquetExec as polars_lazy::physical_plan::Executor>::execute::{{closure}}
at /.../polars/polars/polars-lazy/src/physical_plan/executors/scan/parquet.rs:62:68
22: polars_lazy::physical_plan::file_cache::FileCache::read
at /.../polars/polars/polars-lazy/src/physical_plan/file_cache.rs:40:13
23: <polars_lazy::physical_plan::executors::scan::parquet::ParquetExec as polars_lazy::physical_plan::Executor>::execute
at /.../polars/polars/polars-lazy/src/physical_plan/executors/scan/parquet.rs:60:9
24: <polars_lazy::physical_plan::executors::udf::UdfExec as polars_lazy::physical_plan::Executor>::execute
at /.../polars/polars/polars-lazy/src/physical_plan/executors/udf.rs:12:18
25: polars_lazy::frame::LazyFrame::collect
at /.../polars/polars/polars-lazy/src/frame/mod.rs:720:19
26: gyrfalcon::main
at ./src/main.rs:21:14
27: core::ops::function::FnOnce::call_once
at /rustc/6dbae3ad19309bb541d9e76638e6aa4b5449f29a/library/core/src/ops/function.rs:248:5
Strange - I can read the file you posted here with
# in arrow2
cargo run --release --example parquet_read --features io_parquet,io_parquet_compression,io_print -- part-00003-a422a23f-e65a-4cab-9bd0-6e877a8f7337-c000.snappy.parquet
Changing limit
and chunk_size
of the reader does not impact this. I also tried using the parallel reader.
The error still comes from arrow2
. Can it be the way Polars uses the arrow2
API?
The error still comes from
arrow2
. Can it be the way Polars uses thearrow2
API?
I don't think the fix was already in the polars branch.
I'm building the example from git with updated dependencies in Polars to reference the latest arrow2.
It may be something in the update_chunks
in polars-core/src/chunked_array/logical/struct_/mod.rs
(mod.rs#L76-L80). It does a
StructArray::new(
ArrowDataType::Struct(new_fields.clone()),
field_arrays,
None,
)
Maybe there is something wrong with the params received from polars.
Is there an update on this? I curious on whether something else is required here as this is an important use-case
Is there an update on this? I curious on whether something else is required here as this is an important use-cathis.
If you can run it in arrow, I expect this is something on our side. I will look into this
I can also read the file on latest master:
>>> pl.read_parquet("nested_struct_OutOfSpec.snappy.parquet")
shape: (2, 1)
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā dim ā
ā --- ā
ā struct[4] ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā”
ā {{null,null,null,null,null,null,... ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā {{null,null,null,"2gYhOc2Edy8GBw... ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Thanks for the fix upstream @jorgecarleitao.
@andrei-ionescu we are close to a crates.io release. You can already point to latest master to have your fix working, but it will also work on crates.io soon. :)
I will close this now.
@ritchie46
1) Did you try it with the other file: part-00003-a422a23f-e65a-4cab-9bd0-6e877a8f7337-c000.snappy.parquet.zip? The first one works but not this one.
2) How can I re-open this ticket as it is not resolved?
The issue occurs when appending structs of different chunk sizes.
MWE:
s = pl.Series([{'_experience': {'aaid': {'id': '7759804769753743647',
'namespace': {'code': '3245164418740504690'},
'primary': True},
'mcid': {'id': None, 'namespace': {'code': None}, 'primary': None}}},
{'_experience': {'aaid': {'id': '8337071409830986729',
'namespace': {'code': '3245164418740504690'},
'primary': False},
'mcid': {'id': '6495617396286731444',
'namespace': {'code': '3624253825458969727'},
'primary': True}}},
{'_experience': {'aaid': {'id': '5948492535810675291',
'namespace': {'code': '3245164418740504690'},
'primary': True},
'mcid': {'id': None, 'namespace': {'code': None}, 'primary': None}}}])
s.append(s[:2])
map
datatype, which is not supported by polars.@ritchie46, thanks for looking into this.
map
datatype? Is there a plan to add map support in Polars?Polars will not add the map dtype. It's benefit do not outweigh the extra complexity. Maybe we can investigate conversion of maps to struct. But I will have to explore that.
With #4226 we can read the entire file. The map
dtype will be converted to its physical type which is supported by polars.
@ritchie46, @jorgecarleitao: We need to re-open this one more time.
With the code given bellow and the previous file ā part-00003-a422a23f-e65a-4cab-9bd0-6e877a8f7337-c000.snappy.parquet.zip ā I get again the OutOfSpec
error.
let df = LazyFrame::scan_parquet(
file_location,
ScanArgsParquet::default())
.unwrap()
.filter(
col("timestamp").cast(DataType::Datetime(TimeUnit::Nanoseconds, None))
.gt(datetime(DatetimeArgs {
year: lit(2022),
month: lit(1),
day: lit(1),
hour: None,
minute: None,
second: None,
millisecond: None
}))
)
.select([
count().alias("monthcount"),
col("timestamp"),
])
.collect()
.unwrap();
dbg!(df);
When I remove the filter, it does not panic.
Here is the panic error:
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at
'called `Result::unwrap()` on an `Err` value:
OutOfSpec("The children must have an equal number of values.\n
However, the values at index 1 have a length of 1, which is different
from values at index 0, 2.")
called `Result::unwrap()` on an `Err` value: OutOfSpec("The children
must have an equal number of values.\n
However, the values at index 1 have a length of 1, which is different
from values at index 0, 2.")', ',
/.../.cargo/git/checkouts/arrow2-8a2ad61d97265680/8604cb7/src/array/struct_/mod.rs
/.../.cargo/git/checkouts/arrow2-8a2ad61d97265680/8604cb7/src/array/struct_/mod.rs::118118::5252
stack backtrace:
0: rust_begin_unwind
at /rustc/f9cba63746d0fff816250b2ba7b706b5d4dcf000/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/f9cba63746d0fff816250b2ba7b706b5d4dcf000/library/core/src/panicking.rs:142:14
2: core::result::unwrap_failed
at /rustc/f9cba63746d0fff816250b2ba7b706b5d4dcf000/library/core/src/result.rs:1814:5
3: <arrow2::io::parquet::read::statistics::struct_::DynMutableStructArray as arrow2::array::MutableArray>::as_box
4: <arrow2::io::parquet::read::statistics::list::DynMutableListArray as arrow2::array::MutableArray>::as_box
5: <arrow2::io::parquet::read::statistics::Statistics as core::convert::From<arrow2::io::parquet::read::statistics::MutableStatistics>>::from
6: arrow2::io::parquet::read::statistics::deserialize
7: polars_io::parquet::predicates::read_this_row_group
8: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
9: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
10: <rayon::iter::map::MapFolder<C,F> as rayon::iter::plumbing::Folder<T>>::consume_iter
11: rayon::iter::plumbing::bridge_producer_consumer::helper
12: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
13: rayon_core::registry::WorkerThread::wait_until_cold
14: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
0: rust_begin_unwind
at /rustc/f9cba63746d0fff816250b2ba7b706b5d4dcf000/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/f9cba63746d0fff816250b2ba7b706b5d4dcf000/library/core/src/panicking.rs:142:14
2: core::result::unwrap_failed
at /rustc/f9cba63746d0fff816250b2ba7b706b5d4dcf000/library/core/src/result.rs:1814:5
3: <arrow2::io::parquet::read::statistics::struct_::DynMutableStructArray as arrow2::array::MutableArray>::as_box
4: <arrow2::io::parquet::read::statistics::list::DynMutableListArray as arrow2::array::MutableArray>::as_box
5: <arrow2::io::parquet::read::statistics::Statistics as core::convert::From<arrow2::io::parquet::read::statistics::MutableStatistics>>::from
6: arrow2::io::parquet::read::statistics::deserialize
7: polars_io::parquet::predicates::read_this_row_group
8: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
9: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
10: <rayon::iter::map::MapFolder<C,F> as rayon::iter::plumbing::Folder<T>>::consume_iter
11: rayon::iter::plumbing::bridge_producer_consumer::helper
12: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
13: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
14: rayon_core::registry::WorkerThread::wait_until_cold
15: rayon_core::registry::ThreadBuilder::run
Could you have another look?
@ritchie46, @jorgecarleitao Any updates on this?
@andrei-ionescu found another issue, opened it upstream https://github.com/jorgecarleitao/arrow2/issues/1239.
@ritchie46, @jorgecarleitao Thanks for looking into it! I've seen the upstream ticket and fix PR are complete. Is it ready in this PR? Can I run another set of tests?
Yes, give it a spin. :)
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("The children must have an equal number of values.\n However, the values at index 21 have a length of 11896, which is different from values at index 0, 11901.")', /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow2-0.17.3/src/array/struct_/mod.rs:122:52
Folks I am facing a similar error on the latest version any pointers as to how i can fix this ??
Hi @rajatkb-sc - I also encountered the same issue in 0.19. I was able to narrow it down to an empty struct inside a nested list in a json file. I wrote a script to loop through the json and delete empty nodes before loading to a dataframe, and it resolved the issue.
What language are you using?
Rust
Which feature gates did you use?
"polars-io", "parquet", "lazy", "dtype-struct"
Have you tried latest version of polars?
What version of polars are you using?
Latest,
master
branch.What operating system are you using polars on?
macOS Monterey 12.3.1
What language version are you using
Describe your bug.
Reading nested struct panics with
OutOfSpec
error.What are the steps to reproduce the behavior?
Given the attached parquet file with only 2 rows: nested_struct_OutOfSpec.snappy.parquet.zip
Running the following code:
Results in this panic error:
What is the actual behavior?
The result is a panic error with this output:
What is the expected behavior?
The parquet file should have been correctly loaded.
The
parquet-tools
util shows it property. Also, Apache Spark properly reads it and processes it.