h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

parse fails with 9000 cols. could be total columnNames length related, not strictly column #? or could be columns. no stack trace. #13096

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I think I'm doing things right

test:

cd h2o-dev/py2/testdir_single_jvm python test_parse_100k_cols.py (it doesn't do 100k now, dialed back)

to run with your idea or existing java -jar python test_parse_100k_cols.py -uc

or if you need to specify port python test_parse_100k_cols.py -uc localhost:54321

error: [Errno 104] Connection reset by peer attached the h2o stdout and the commands.log I also attached the last dataset, which makes it fail.

I think columnNames being passed from Parse setup to Parse is too big?? dunno Is it supposed to post somehow? I copied the Parse and setup from ray. I modified some stuff, but it should basically be the same concept (see def parse( in h2o-dev/py2/h2o_ray.py

When running I print a yellow message when I stop printing parameters to the screen because there are too many colnames. Look at sandbox/commands.log if you want to see what h2o gets.

Now I could be doing something wrong, but I think I'm doing Parse setup and parse correctly (cause I started with rays def parse() and modified it but it works fine for lots of things..so I think it's the colname length

I took my parse test for 100k cols that runs on h2o. it creates small synthetic datasts

it failed with 100k cols. I thought it might be due to the way the colNames are capture from the parse setup and passed to parse.

So I changed the test (it iterates through a list) to do 1000 2000 4000 8000 9000 10000 to see where it passed/failed it passed up to 9000 cols and fails. It gets a connection error at h2o

It's easy to modify the test to do the 9000 case first (see tryList in test)

I don't do the Frame on the result till the parse is done. so it's not getting the known NPE case. (and 1000 thru 8000 work) (i have the row/col check if'ed out of the test for now)

it's a 14G heap

2014-11-12 00:53:56.391024 -- java -Xms14G -Xmx14G -ea -jar /home/kevin/h2o-dev/build/h2o.jar -port 54321 -ice_root sandbox/ice.l5ruLP -name pytest-kevin-8552 #PID 8558, stdout local-h2o-0.stdout.qgXt7p.log, stderr local-h2o-0.stderr.VNuoxD.log

The commands to h2o-dev for 1000 cols look like this:

2014-11-12 00:53:57.112710 -- Start http://192.168.0.34:54321/2/ImportFiles.json?path=/home/kevin/h2o-dev/py2/testdir_single_jvm/sandbox/syn_datasets

2014-11-12 00:53:57.185158 -- Start http://192.168.0.34:54321/ParseSetup.json?srcs=[nfs://home/kevin/h2o-dev/py2/testdir_single_jvm/sandbox/syn_datasets/syn_6141194274908745363_10x1000.csv]

the actual parse request

2014-11-12 00:53:57.349976 -- Start http://192.168.0.34:54321/Parse.json?srcs=[nfs://home/kevin/h2o-dev/py2/testdir_single_jvm/sandbox/syn_datasets/syn_6141194274908745363_10x1000.csv]&checkHeader=-1&ncols=1000&sep=44&columnNames=[C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26,C27,C28,C29,C30,C31,C32,C33,C34,C35,C36,C37,C38,C39,C40,C41,C42,C43,C44,C45,C46,C47,C48,C49,C50,C51,C52,C53,C54,C55,C56,C57,C58,C59,C60,C61,C62,C63,C64,C65,C66,C67,C68,C69,C70,C71,C72,C73,C74,C75,C76,C77,C78,C79,C80,C81,C82,C83,C84,C85,C86,C87,C88,C89,C90,C91,C92,C93,C94,C95,C96,C97,C98,C99,C100,C101,C102,C103,C104,C105,C106,C107,C108,C109,C110,C111,C112,C113,C114,C115,C116,C117,C118,C119,C120,C121,C122,C123,C124,C125,C126,C127,C128,C129,C130,C131,C132,C133,C134,C135,C136,C137,C138,C139,C140,C141,C142,C143,C144,C145,C146,C147,C148,C149,C150,C151,C152,C153,C154,C155,C156,C157,C158,C159,C160,C161,C162,C163,C164,C165,C166,C167,C168,C169,C170,C171,C172,C173,C174,C175,C176,C177,C178,C179,C180,C181,C182,C183,C184,C185,C186,C187,C188,C189,C190,C191,C192,C193,C194,C195,C196,C197,C198,C199,C200,C201,C202,C203,C204,C205,C206,C207,C208,C209,C210,C211,C212,C213,C214,C215,C216,C217,C218,C219,C220,C221,C222,C223,C224,C225,C226,C227,C228,C229,C230,C231,C232,C233,C234,C235,C236,C237,C238,C239,C240,C241,C242,C243,C244,C245,C246,C247,C248,C249,C250,C251,C252,C253,C254,C255,C256,C257,C258,C259,C260,C261,C262,C263,C264,C265,C266,C267,C268,C269,C270,C271,C272,C273,C274,C275,C276,C277,C278,C279,C280,C281,C282,C283,C284,C285,C286,C287,C288,C289,C290,C291,C292,C293,C294,C295,C296,C297,C298,C299,C300,C301,C302,C303,C304,C305,C306,C307,C308,C309,C310,C311,C312,C313,C314,C315,C316,C317,C318,C319,C320,C321,C322,C323,C324,C325,C326,C327,C328,C329,C330,C331,C332,C333,C334,C335,C336,C337,C338,C339,C340,C341,C342,C343,C344,C345,C346,C347,C348,C349,C350,C351,C352,C353,C354,C355,C356,C357,C358,C359,C360,C361,C362,C363,C364,C365,C366,C367,C368,C369,C370,C371,C372,C373,C374,C375,C376,C377,C378,C379,C380,C381,C382,C383,C384,C385,C386,C387,C388,C389,C390,C391,C392,C393,C394,C395,C396,C397,C398,C399,C400,C401,C402,C403,C404,C405,C406,C407,C408,C409,C410,C411,C412,C413,C414,C415,C416,C417,C418,C419,C420,C421,C422,C423,C424,C425,C426,C427,C428,C429,C430,C431,C432,C433,C434,C435,C436,C437,C438,C439,C440,C441,C442,C443,C444,C445,C446,C447,C448,C449,C450,C451,C452,C453,C454,C455,C456,C457,C458,C459,C460,C461,C462,C463,C464,C465,C466,C467,C468,C469,C470,C471,C472,C473,C474,C475,C476,C477,C478,C479,C480,C481,C482,C483,C484,C485,C486,C487,C488,C489,C490,C491,C492,C493,C494,C495,C496,C497,C498,C499,C500,C501,C502,C503,C504,C505,C506,C507,C508,C509,C510,C511,C512,C513,C514,C515,C516,C517,C518,C519,C520,C521,C522,C523,C524,C525,C526,C527,C528,C529,C530,C531,C532,C533,C534,C535,C536,C537,C538,C539,C540,C541,C542,C543,C544,C545,C546,C547,C548,C549,C550,C551,C552,C553,C554,C555,C556,C557,C558,C559,C560,C561,C562,C563,C564,C565,C566,C567,C568,C569,C570,C571,C572,C573,C574,C575,C576,C577,C578,C579,C580,C581,C582,C583,C584,C585,C586,C587,C588,C589,C590,C591,C592,C593,C594,C595,C596,C597,C598,C599,C600,C601,C602,C603,C604,C605,C606,C607,C608,C609,C610,C611,C612,C613,C614,C615,C616,C617,C618,C619,C620,C621,C622,C623,C624,C625,C626,C627,C628,C629,C630,C631,C632,C633,C634,C635,C636,C637,C638,C639,C640,C641,C642,C643,C644,C645,C646,C647,C648,C649,C650,C651,C652,C653,C654,C655,C656,C657,C658,C659,C660,C661,C662,C663,C664,C665,C666,C667,C668,C669,C670,C671,C672,C673,C674,C675,C676,C677,C678,C679,C680,C681,C682,C683,C684,C685,C686,C687,C688,C689,C690,C691,C692,C693,C694,C695,C696,C697,C698,C699,C700,C701,C702,C703,C704,C705,C706,C707,C708,C709,C710,C711,C712,C713,C714,C715,C716,C717,C718,C719,C720,C721,C722,C723,C724,C725,C726,C727,C728,C729,C730,C731,C732,C733,C734,C735,C736,C737,C738,C739,C740,C741,C742,C743,C744,C745,C746,C747,C748,C749,C750,C751,C752,C753,C754,C755,C756,C757,C758,C759,C760,C761,C762,C763,C764,C765,C766,C767,C768,C769,C770,C771,C772,C773,C774,C775,C776,C777,C778,C779,C780,C781,C782,C783,C784,C785,C786,C787,C788,C789,C790,C791,C792,C793,C794,C795,C796,C797,C798,C799,C800,C801,C802,C803,C804,C805,C806,C807,C808,C809,C810,C811,C812,C813,C814,C815,C816,C817,C818,C819,C820,C821,C822,C823,C824,C825,C826,C827,C828,C829,C830,C831,C832,C833,C834,C835,C836,C837,C838,C839,C840,C841,C842,C843,C844,C845,C846,C847,C848,C849,C850,C851,C852,C853,C854,C855,C856,C857,C858,C859,C860,C861,C862,C863,C864,C865,C866,C867,C868,C869,C870,C871,C872,C873,C874,C875,C876,C877,C878,C879,C880,C881,C882,C883,C884,C885,C886,C887,C888,C889,C890,C891,C892,C893,C894,C895,C896,C897,C898,C899,C900,C901,C902,C903,C904,C905,C906,C907,C908,C909,C910,C911,C912,C913,C914,C915,C916,C917,C918,C919,C920,C921,C922,C923,C924,C925,C926,C927,C928,C929,C930,C931,C932,C933,C934,C935,C936,C937,C938,C939,C940,C941,C942,C943,C944,C945,C946,C947,C948,C949,C950,C951,C952,C953,C954,C955,C956,C957,C958,C959,C960,C961,C962,C963,C964,C965,C966,C967,C968,C969,C970,C971,C972,C973,C974,C975,C976,C977,C978,C979,C980,C981,C982,C983,C984,C985,C986,C987,C988,C989,C990,C991,C992,C993,C994,C995,C996,C997,C998,C999,C1000]&singleQuotes=False&hex=syn_6141194274908745363_10x1000.hex&pType=CSV

polling

2014-11-12 00:53:57.415525 -- Start http://192.168.0.34:54321/2/Jobs.json/$0301c0a8002232d4ffffffff$_8faffa8c25529997a7e55d2acfab446b? 2014-11-12 00:53:57.935475 -- Start http://192.168.0.34:54321/2/Jobs.json/$0301c0a8002232d4ffffffff$_8faffa8c25529997a7e55d2acfab446b? 2014-11-12 00:53:58.440342 -- Start http://192.168.0.34:54321/2/Jobs.json/$0301c0a8002232d4ffffffff$_8faffa8c25529997a7e55d2acfab446b?

14-11-12 00:53:57.415525 -- Start http://192.168.0.34:54321/2/Jobs.json/$0301c0a8002232d4ffffffff$_8faffa8c25529997a7e55d2acfab446b? 2014-11-12 00:53:57.935475 -- Start http://192.168.0.34:54321/2/Jobs.json/$0301c0a8002232d4ffffffff$_8faffa8c25529997a7e55d2acfab446b? 2014-11-12 00:53:58.440342 -- Start http://192.168.0.34:54321/2/Jobs.json/$0301c0a8002232d4ffffffff$_8faffa8c25529997a7e55d2acfab446b? 2014-11-12 00:53:58.944387 -- Start http://192.168.0.34:54321/2/Jobs.json/$0301c0a8002232d4ffffffff$_8faffa8c25529997a7e55d2acfab446b?

2014-11-12 00:53:58.947789 -- Start http://192.168.0.34:54321/3/Frames.json/syn_6141194274908745363_10x1000.hex?find_compatible_models=0&len=100&offset=0

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: fails with 1900 cols, if the col headers have strings that are length 30 there is comma separating them in the parameters plus leading/trailing []

so it seems to be total string length or some buffer size somewhere. Something around 58k bytes or so. maybe something is only 16 bits.

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: I notice from the python stack trace that it might be iterating over multi-part form data or something. I wonder if the CONTENT_CHUNK_SIZE limits a single part, and I'm transitioning to multi-part and maybe h2o doesn't tolerate that? I'll try to see if I can get some header info

this got me wondering File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 694, in content self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes() File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 627, in generate for chunk in self.raw.stream(chunk_size, decode_content=True): File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/response.py", line 240, in stream data = self.read(amt=amt, decode_content=decode_content) File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/response.py", line 187, in read data = self._fp.read(amt) File "/usr/lib/python2.7/httplib.py", line 561, in read s = self.fp.read(amt) File "/usr/lib/python2.7/socket.py", line 380, in read

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: Oh I see, the parse is doing a 'get' not a 'post'

Is it supposed to do a post? If we have long parameters every, does everything need to be a 'post'

?

I'm surprised it worked up to 9000 cols? get can pass data?

you can see my parse, does a get File "../h2o_import.py", line 416, in import_parse benchmarkLogging, noPoll, kwargs) File "../h2o_import.py", line 384, in parse_only benchmarkLogging=benchmarkLogging, noPoll=noPoll, kwargs) File "../h2o_ray.py", line 189, in parse parse_result = self.do_json_request(jsonRequest="Parse.json", params=parse_params, timeout=timeoutSecs) File "../h2o_objects.py", line 214, in do_json_request r = requests.get(url, timeout=timeout, params=params, kwargs) File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get return request('get', url, kwargs)

Ray: looking at your h2o.py for the parse (which I copied)..I don't see a post cmd here? parse_result = self.__do_json_request(jsonRequest="Parse.json", timeout=timeoutSecs, params=parse_params, **kwargs) and the default in __do_json_request is "get"

I notice ModelBuilders passes parameters with postData, and cmd='post' .. maybe parse needs to also?

I tried doing a post but h2o didn't like that for parse

JSON call returned non-200 status with http://192.168.0.34:54321/Parse.json r.status_code: 404

This is the current get parse_result = self.do_json_request( jsonRequest="Parse.json", params=parse_params, timeout=timeoutSecs)

This is what I tried to post parse_result = self.do_json_request(jsonRequest="Parse.json", timeout=timeoutSecs, cmd='post', postData=parse_params)

exalate-issue-sync[bot] commented 1 year ago

Raymond Peck commented: Yeah, when Cliff wrote it we didn't have any POST examples yet, so it uses GET. See the route in RequestServer, here:

addToNavbar(register("/ParseSetup" ,"GET",ParseSetupHandler .class,"guessSetup"  ,"Guess the parameters for parsing raw byte-oriented data into an H2O Frame."),"/ParseSetup","ParseSetup",    "Data");
addToNavbar(register("/Parse"      ,"GET",ParseHandler      .class,"parse"       ,"Parse a raw byte-oriented Frame into a useful columnar data Frame."),"/Parse"      , "Parse",         "Data");

I'll get it working using POST and see how far I get.

exalate-issue-sync[bot] commented 1 year ago

Raymond Peck commented: Fixed. It was the GET URL limitation.

I've added POST routes for Parse, ParseData and Rapids, and updated the Python bindings. 100k columns now works fine.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-82 Assignee: Raymond Peck Reporter: Kevin Normoyle State: Resolved Fix Version: N/A Attachments: Available (Count: 3) Development PRs: N/A

Attachments From Jira

Attachment Name: commands.log Attached By: Kevin Normoyle File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-82/commands.log

Attachment Name: local-h2o-0.stdout.Kjq9vg.log Attached By: Kevin Normoyle File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-82/local-h2o-0.stdout.Kjq9vg.log

Attachment Name: syn_8318134903216933205_10x9000.csv Attached By: Kevin Normoyle File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-82/syn_8318134903216933205_10x9000.csv