javascriptdata / danfojs

Danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.
https://danfo.jsdata.org/
MIT License
4.81k stars 209 forks source link

Sorting DataFrame on encoded String corrupts data #31

Closed danielruss closed 4 years ago

danielruss commented 4 years ago

Hello, If you have a dataframe that contains strings, then you cannot sort it. So I tried to use the label encoder and create a new column filled with the encoded labels. If you sort on the encoded labels, the dataframe is corrupted.

Here is an browser example:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <script src="https://cdn.jsdelivr.net/npm/danfojs@0.1.1/dist/index.min.js"></script>
    <title>Document</title>
  </head>
  <body>
    <script>
      df = new dfd.DataFrame({
        X1: ["c", "a", "b", "c", "c", "a", "b"],
        X2: ["^", "%", "!", "#", "$", "&", "*"],
      });
      let encoder = new dfd.LabelEncoder();
      let x2 = df.X1.unique().values.sort();
      console.log(x2);
      encoder.fit(x2);
      df.addColumn({ column: "X1_encoded", value: encoder.transform(df.X1) });
      df2 = df.sort_values({ by: "X1_encoded" });

      df.print()
      df2.print()
    </script>
  </body>
</html>

The output is:

i X1 X2 X1_encoded
0 c ^ 2
1 a % 0
2 b ! 1
3 c # 2
4 c $ 2
5 a & 0
6 b * 1
i X1 X2 X1_encoded
1 a % 0
1 a % 0
2 b ! 1
2 b ! 1
0 c ^ 2
0 c ^ 2

Notice that the rows are all the same for each level of X1_encoded. The original rows 3-6 are lost.

steveoni commented 4 years ago

It has being fixed. Update your version to 0.1.2. Thanks

danielruss commented 4 years ago

I updated the link for the CDN to 0.1.2, and I got the same results.

<script src="https://cdn.jsdelivr.net/npm/danfojs@0.1.2/dist/index.min.js"></script>

steveoni commented 4 years ago

That's true. I was able to reproduce the error. we will fix it.

risenW commented 4 years ago

Hello, If you have a dataframe that contains strings, then you cannot sort it. So I tried to use the label encoder and create a new column filled with the encoded labels. If you sort on the encoded labels, the dataframe is corrupted.

Here is an browser example:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <script src="https://cdn.jsdelivr.net/npm/danfojs@0.1.1/dist/index.min.js"></script>
    <title>Document</title>
  </head>
  <body>
    <script>
      df = new dfd.DataFrame({
        X1: ["c", "a", "b", "c", "c", "a", "b"],
        X2: ["^", "%", "!", "#", "$", "&", "*"],
      });
      let encoder = new dfd.LabelEncoder();
      let x2 = df.X1.unique().values.sort();
      console.log(x2);
      encoder.fit(x2);
      df.addColumn({ column: "X1_encoded", value: encoder.transform(df.X1) });
      df2 = df.sort_values({ by: "X1_encoded" });

      df.print()
      df2.print()
    </script>
  </body>
</html>

The output is:

i X1 X2 X1_encoded 0 c ^ 2 1 a % 0 2 b ! 1 3 c # 2 4 c $ 2 5 a & 0 6 b * 1 i X1 X2 X1_encoded 1 a % 0 1 a % 0 2 b ! 1 2 b ! 1 0 c ^ 2 0 c ^ 2 Notice that the rows are all the same for each level of X1_encoded. The original rows 3-6 are lost.

I was also able to reproduce this error. It occurs when sorting a column with non-unique entries. The index gets duplicated. This is a minor fix @steveoni is currently on it.

steveoni commented 4 years ago

FIXED. See commit here