Reading challenge file #4

Open twocs opened 7 years ago

twocs commented 7 years ago

The challenge.txt data file was in a different format than the brain_body.txt data file used for demo.py

To make it much more straightforward to do this work I'd suggest to either:

  1. Make a mention in the problem that knowing how to read in this different data format is also something that's necessary, and a link to somewhere to find out how to use pandas to read the new data format or
  2. Have both data files in the same format (i.e. both in fwf format)
agoila commented 7 years ago

To get this file in the same format as the brain_body.txt, you can follow this syntax:

dataframe = pd.read_csv('challenge_dataset.txt', sep=',', header=None, names=['X', 'Y'])

You can see the difference between read_fwf and read_csv here: https://cl.ly/0D1e2s2u2d0J

twocs commented 7 years ago

If I recall, there is also the difference of header. Demo.txt has the column headings, but challenge_dataset.txt should have header=None, as you've shown.

I think I spent most of the time figuring out how to read the csv file and access the data, with only a little effort for the linear regression.

twocs commented 7 years ago

Would a link to pandas read_csv in the Readme be helpful or is there a more accessible source of info? The scikit-learn tutorial that is linked in the Readme doesn't seem to use pandas.


noguess commented 7 years ago

where will I push my codes?

Just replace dataframe = pd.read_fwf('brain_body.txt') for dataframe = pd.read_csv('challenge_dataset.txt',names=('x','y'))


twocs commented 7 years ago
where will I push my codes?
Just replace
`dataframe = pd.read_fwf('brain_body.txt')
`dataframe = pd.read_csv('challenge_dataset.txt',names=('x','y'))

The point is not that it is possible for someone who knows how to use pandas to solve the I/O issues. The point is that there is no link to learn about pandas. If we are following the demo, we would try to use:

dataframe = pd.read_csv('challenge_dataset.txt')

But this is a problem, because the first line is inferred to be a header, not numerical data. That will affect the result of logical regression. We must therefore use:

dataframe = pd.read_csv('challenge_dataset.txt', header=None)

and may access the data as follows:

x_values = dataframe[[0]]
y_values = dataframe[[1]]

In solving the above, I tried consulting the pandas documentation, but it's very complicated as there are many optional parameters. Because of this issue with figuring out the appropriate function signature for pandas.read_csv, I filed this issue. To fix this issue, I would propose the same proposals I proposed earlier in this issue report.

Here is the function signature for pandas.csv from the documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)[source]

For what it's worth, this is the official guide for pandas I/O, which is slightly more useful: http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table

aomnes commented 7 years ago

You can add header=0

dataframe = pd.read_csv('challenge_dataset.txt', header=0, names=["x", "y"])

It works for me with that.

My challenge_dataset.txt :

x,y 6.1101,17.592 5.5277,9.1302 8.5186,13.662

LovelyBuggies commented 5 years ago

LR - challenge_dataset.txt

import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt

#read data
dataframe = pd.read_csv('challenge_dataset.txt', sep=',', header=None, names=['X', 'Y'])
x_values = dataframe[['X']]
y_values = dataframe[['Y']]

#train model on data
body_reg = linear_model.LinearRegression()
body_reg.fit(x_values, y_values)

#visualize results
plt.scatter(x_values, y_values)
plt.plot(x_values, body_reg.predict(x_values))